Skull-stripping, also known as brain extraction, involves the removal of non-brain tissue signal from magnetic resonance imaging (MRI) data. This process is useful for anonymizing brain scans and a fundamental component of many neuroimage analysis pipelines, such as FreeSurfer (Fischl, 2012), FSL (Jenkinson et al., 2012), AFNI (Cox, 1996), and ANTs (Avants et al., 2011). These packages include tools that typically require brain-extracted input images and might perform inaccurately, or even fail, without removal of irrelevant and distracting tissue. One such class of algorithms that benefits from this systematic tissue extraction is image registration, a core element of atlas-based segmentation and other analyses. Nonlinear registration (Ashburner, 2007; Avants et al., 2008; Rueckert et al., 1999; Vercauteren et al., 2009)estimates local deformations between pairs of images, and these algorithms tend to produce more accurate estimates when they can focus entirely on the anatomy of interest (Klein et al., 2009; Ou et al., 2014). Similarly, skull-stripping increases the reliability of linear registration (Cox and Jesmanowicz, 1999; Friston et al., 1995; Hoffmann et al., 2015; Jenkinson and Smith, 2001; Jiang et al., 1995; Modat et al., 2014; Reuter et al., 2010) by excluding anatomy that deforms non-rigidly, such as the eyes, jaw, and tongue (Andrade et al., 2018; Fein et al., 2006; Fischmeister et al., 2013; Hoffmann et al., 2020).
Classical skull-stripping techniques are well-explored and widespread, but popular methods are often tailored to images with specific modalities or acquisition properties. Most commonly, these methods focus on three-dimensional (3D) T1-weighted (T1w) MRI scans acquired with MPRAGE sequences (Marques et al., 2010; Mugler III and Brookeman, 1990; van der Kouwe et al., 2008), which are ubiquitous in neuroimaging research. While some skull-stripping tools accommodate additional contrasts, these methods are ultimately limited to a predefined set of viable image types and do not properly adapt to inputs outside this set. For example, skull-stripping tools developed for near-isotropic, adult brain images may perform poorly when applied to infant subjects or clinical scans with thick slices, such as stacks of 2D fast spin-echo (FSE) acquisitions.
When a suitable brain extraction method is not available for a particular scan type, a common workaround involves skull-stripping a compatible image of the same subject and computing a co-registration to propagate the extracted brain mask to the target image of interest (Iglesias et al., 2011). Unfortunately, an accurate intra-subject alignment can require significant manual tuning because the target image still includes extra-cerebral matter that may impede linear registration quality (Reuter et al., 2010). Crucially, this procedure also requires the existence of an additional, strip-able image, often a high-resolution isotropic T1w or T2-weighted (T2w) scan, which is rare in clinical screening protocols and introduces a barrier to the clinical adoption of analysis tools.
While classical algorithms for skull-stripping are limited by their assumptions about the spatial features and intensity distributions in the input images, supervised deep-learning approaches, which leverage convolutional neural networks (CNNs), can, in principle, learn to extract a region of interest from any image type given sufficient anatomical contrast and resolution. In practice, these networks achieve high accuracy for data types observed during training, but their performance often deteriorates on images with characteristics unseen during training(Hendrycks et al., 2021; Hoffmann et al., 2021b; Jog et al., 2019; Karani et al., 2018)
. In consequence, robust, supervised learning-based approaches depend on the availability of a large training dataset that contains accurate ground-truth annotations and exposes the network to a landscape of image types. While numerous public datasets provide access to popular MRI acquisitions for which target brain masks can be easily derived with classical methods, curating a diverse training dataset with uncommon sequences and sufficient anatomical variability is a challenging task that requires substantial human effort. As a result, current deep-learning skull-stripping methods are trained with few different data types and deliver state-of-the-art results only for particular subsets of image characteristics(Hwang et al., 2019; Kleesiek et al., 2016; Salehi et al., 2017).
Recently, a novel learning strategy breaks the dependency on data availability by training networks with a wide array of synthetic images, each generated directly from a precomputed label map. This synthesis scheme enables networks to accurately carry out tasks on any image type at evaluation-time without ever sampling real target acquisitions during training, and it has been effectively employed for segmentation (Billot et al., 2020) and deformable image registration (Hoffmann et al., 2021b). To build on deep-learning methods for brain extraction while addressing their shortcomings, we adapt the synthesis technique and introduce SynthStrip, a flexible brain-extraction tool that can be deployed universally on a variety of brain images. By exposing a CNN to an arbitrary and deliberately unrealistic range of anatomies, contrasts, and artifacts, we obtain a model that is agnostic to acquisition specifics, as it never samples any real data during training. Consequently, this scheme enables SynthStrip to extract the brain from neuroimaging data of any kind, and we demonstrate its viability and improvement over popular baselines using a varied test set that is representative of many research scans and clinical exams (Figure 1). The test set includes T1w, T2w, T2w fluid attenuated inversion recovery (T2-FLAIR), and proton-density (PDw) contrasts as well as clinical FSE scans with slices and high in-plane resolution, and low-resolution EPI, ranging across age and pathology. We demonstrate the ability of SynthStrip to generalize beyond structural MRI, to MR angiography (MRA), diffusion-weighted imaging (DWI), positron emission tomography (PET), and even computed tomography (CT). We make our validation set publicly available to promote further development and evaluation of brain extraction tools.
2 Related Work
In this section, we briefly review a variety of automated techniques for brain extraction introduced over the last two decades, focusing in particular on those with high efficacy and popularity in the research domain. For an exhaustive overview of skull-stripping methods, see Fatima et al. (2020).
2.1 Classical Skull-Stripping
Classical, or traditional, algorithms that remove non-brain image signal are well-refined and vary substantially in their implementation. One common class of approaches leverages a deformable mesh model to reconstruct a smooth boundary of the brain matter surface. The widely-used Brain Extraction Tool (BET) (Smith, 2002), distributed as part of the FSL package (Jenkinson et al., 2012), utilizes this technique by initializing a spherical surface mesh at the barycenter of the brain and projecting mesh vertices outwards to model the brain border. Since BET uses locally adaptive intensity thresholds to distinguish brain and non-brain voxels, it generalizes well across a variety of image contrasts, such as T1w, T2w, and PDw. To further prevent surface leaks beyond the brain boundary, 3dSkullStrip, a component of AFNI (Cox, 1996), extends the BET strategy by considering information on the surface exterior, accounting for eyes, ventricles, and skull.
The popular hybrid approach (Ségonne et al., 2004)
available in FreeSurfer also leverages a deformable surface paradigm, combing it with a watershed algorithm and statistical atlas to improve general robustness. First, the watershed establishes an estimate of the brain white-matter mask, which is then refined to the brain boundary using a surface mesh expansion. A probabilistic atlas of intensity distributions helps prevent outliers during mesh fitting, and erroneous brain mask voxels are further removed during post-processing via a graph cuts algorithm(Greig et al., 1989; Sadananthan et al., 2010) that thresholds the cerebrospinal fluid (CSF). While effective, this technique is optimized only for images with T1w contrast, since it relies on the underlying assumption that white matter is surrounded by hypointense gray matter and CSF regions.
Another hybrid approach, ROBEX (Iglesias et al., 2011)
, exploits a joint generative and discriminative model to segment the brain mask. A Random Forest(Breiman, 2001) classification detects the brain contour, which is used to fit a point-distribution model to the brain target. The skull-stripping tool BEaST (Eskildsen et al., 2012) builds on patch-based, non-local segmentation (Coupé et al., 2010, 2011) techniques and assigns a label to each voxel by comparing its local neighborhood to patches in a reference set with prior labels. Brain Surface Extraction (BSE) (Shattuck et al., 2001) uses a combination of anisotropic diffusion filtering, edge-detection, and morphological operations but is sensitive to parameterization and dependent on high-quality data. With the exception of BET and 3dSkullStrip, all of these tools were specifically developed for T1w images.
2.2 Deep-Learning Approaches
Innovations in deep-learning have gained popularity as methodological building blocks for an array of tasks in medical image analysis, including skull-stripping. Various learning-based extraction methods have been proposed, demonstrating accuracy and speed that often out-perform their classical counterparts. These models are optimized in a supervised fashion, using a set of acquired training images with corresponding ground-truth brain masks, derived through classical methods or manual segmentation. An early, cross-contrast approach, Deep MRI Brain Extraction (DMBE) (Kleesiek et al., 2016), trains a 3D CNN on combinations of T1w, T2w, and FLAIR contrasts and matches the accuracy of classical baselines for several datasets, including clinical scans with brain tumors. Conversely, Auto-Net (Salehi et al., 2017) introduces two separate 2.5D architectures that skull-strip volumes by individually segmenting sagittal, coronal, and transverse views of same image and fusing the predictions with an auto-context algorithm (Tu and Bai, 2009). The first architecture leverages convolutions on single-resolution voxel-wise patches, while the second utilizes a scale-space U-Net (Ronneberger et al., 2015) architecture to predict the brain mask. Auto-Net was shown to be effective for both adult and neonatal brain scans, but was only trained with T1w images. More recently, Hwang et al. (2019) implement a full 3D U-Net that robustly matches or exceeds start-of-the art brain extraction performance on T1w scans.
To predict robust brain masks for an array of real image types, we train a deep convolutional neural network on a vast landscape of images synthesized with a deliberately unrealistic range of anatomies, acquisition parameters, and artifacts. From a datasetof precomputed, whole-head segmentations with brain and non-brain tissue labels, we sample a segmentation at each optimization step and use it to generate a gray-scale head scan with randomized acquisition characteristics. In effect, this paradigm yields an endless stream of arbitrary images that we use to optimize a SynthStrip network , with trainable parameters , in a supervised fashion:
where is the predicted brain mask, is the target brain mask derived by merging the brain labels of , and
is the loss function that measures similarity betweenand .
Building from previous work (Billot et al., 2020; Hoffmann et al., 2021b), we use a generative model to synthesize a stream of random images with substantial anatomical and intensity variation, as exhibited in Figure 2
. At each training step, parameters that dictate synthesis components are randomly sampled from predetermined ranges and probability values explicitly defined in Table1. We emphasize that while the generated scans can appear implausible, these training images do not need to be realistic in order for the SynthStrip model to accurately generalize to real images at test-time.
To generate a gray-scale image from a whole-head anatomical segmentation , we first create spatial variability to subject the network to a landscape of possible head positions and anatomical irregularities. This is accomplished by manipulating with a spatial transformation
, composed of an affine transform (with random translation, scaling, and rotation) and a nonlinear deformation. The deformation is generated by sampling random 3D displacement vectors from a normal distribution, with random scale, at an arbitrarily low image resolution. This random displacement field is vector-integrated, using fivescaling and squaring steps to encourage a diffeomorphic warp (Arsigny et al., 2006; Dalca et al., 2019), and tri-linearly resampled to match the resolution of . After applying the randomized transform, the resulting segmentation serves as the basis for deriving the image and target brain mask , which is obtained by merging the labels of into brain and non-brain classes.
, we consider a Bayesian model of MR contrast, which assumes that the voxel intensity of each tissue type in the image can be represented by a single Gaussian distribution. Reversing this generalization, we assign a random distribution of tissue intensity to every anatomical label inand use this artificial mixture model to attain an image with arbitrary contrast by replacing each label voxel in with a random value drawn from its corresponding intensity distribution. Following the synthesis, we aim to simulate various artifacts and geometric properties that might exist across modality and acquisition type. First, we corrupt the image with a spatially-varying intensity bias field, generated by resizing a low-resolution image sampled from a normal distribution with zero mean. The corrupted image is computed by an element-wise multiplication with the voxel-wise exponential of the bias field. Second, we perform gamma augmentation by globally exponentiating all voxel values with a single normally sampled parameter. Lastly, to account for scans with a partial field of view (FOV) and varied resolution, we randomly crop the image content and down-sample along an indiscriminate set of axes.
|Uniform sampling range|
|Affine translation||0–50 mm|
|Deformation voxel length||8–16 mm|
|Deformation SD||0–3 mm|
|Label intensity mean||0–1|
|Label intensity SD||0–0.1|
|Bias field voxel length||4–64 mm|
|Bias field SD||0–0.5|
|FOV cropping (any axis)||0–50 mm|
Uniform hyperparameter sampling ranges used for synthesizing a training image from a source segmentation map. The specific values were chosen by visual inspection of the generated images, to produce a landscape of image contrasts, anatomies, and acquisition characteristics that far exceed the realistic range of medical images. We sample fields with isotropic voxels of the indicated side length, where SD abbreviates standard deviation.
We optimize using a loss function that measures the similarity between predicted and target brain masks. Unless otherwise stated, we employ a loss that encourages the network to predict a signed distance transform (SDT) representing the minimum distance (in mm) to the skull boundary at each voxel. Distances are positive within the brain and negative outside, facilitating the extraction of a binary brain mask from at test-time by simple thresholding. The training paradigm is outlined in Figure 3. During training, an exact target Euclidean SDT is computed from the target brain mask , and the similarity between and is measured by their mean squared difference (MSE). To concentrate optimization gradients to pertinent regions of the image during training, is banded such that voxel distances do not surpass a discrete threshold , and all voxels that exceed the distance are down-weighted in the MSE computation by a factor . Therefore,
where represents each voxel in the spatial image domain , mm, and in our experiments, optimally determined via a grid search.
As a complimentary analysis, we compare the distance-based loss against a soft Dice loss (Dice, 1945; Milletari et al., 2016), which is commonly used to optimize image segmentation models and quantifies volume overlap for pairs of labels. We define the loss as
where and represent brain label maps, and represent non-brain label maps, and and represent voxel-wise multiplication and addition, respectively. While and both result in effective skull-stripping networks, we favor the distance loss due to its smoothing effect on the outline of the predicted brain mask, as demonstrated in Experiment 4.3.
using a U-Net convolutional architecture, with down-sampling (encoder) and up-sampling (decoder) components that facilitate the integration of features across large spatial regions. The U-Net comprises seven resolution levels, which each include two convolutional operations with leaky ReLU activations (parameter) and filter numbers defined in Figure 3
. Down-sampling is achieved through max-pooling, and skip-connections are formed by concatenating the outputs of each encoder level with the inputs of the decoder level with corresponding resolution. In models using, one final, single-feature convolutional layer with linear activation outputs the predicted SDT . In models optimized with , the final layer is a two-feature convolution, with softmax activation, that outputs a probabilistic segmentation representing non-brain and brain regions.
We train SynthStrip using the Adam optimizer (Kingma and Ba, 2014) with a batch size of one and an initial learning rate of . This rate is reduced by a factor of two after every 20,000 optimization steps without a decrease in validation loss. At test-time, all inputs to the model are internally conformed to 1-mm
isotropic voxel size using trilinear interpolation, and intensities are scaled between 0 and 1. The U-Net outputs are resampled such that the final brain mask is computed in the original input space. We implement SynthStrip in Python, using the open-source PyTorch(Paszke et al., 2019) and Neurite (Dalca et al., 2018) libraries, and make our tool and associated code available in the open-source FreeSurfer package (https://w3id.org/synthstrip). All experiments are conducted using Intel Xeon Silver 4214R CPUs and Nvidia RTX 8000 GPUs.
In our experiments, we employ a small training dataset of adult and infant brain segmentations and a separate, larger dataset of acquired images for validation and testing that spans across age, health, resolution, and imaging modality. All data are 3D images, acquired either directly or as stacks of 2D MRI slices.
3.4.1 Training Data
Datasets: We compose a set of 80 training subjects, each with whole-head tissue segmentations, from the following three cohorts: 40 adult subjects from the Buckner40 dataset (Fischl and others, 2002), 30 locally scanned adult subjects from the Human Connectome Aging Project (HCP-A) (Bookheimer et al., 2019; Harms and others, 2018), and 10 infant subjects born full-term, scanned at Boston Children’s Hospital at ages between 0 and 18 months (de Macedo Rodrigues et al., 2015).
Processing: To compute anatomical segmentations of individual cerebral regions, adult and infant T1w scans are processed with SAMSEG (Puonti et al., 2016) and the Infant FreeSurfer reconstruction pipeline (Zöllei et al., 2020), respectively. In order to build complete segmentation maps for robust whole-head image synthesis, we also generate six coarse labels of extra-cerebral tissue using a simple intensity-based labeling strategy with thresholds that mark label intensity boundaries. Considering only non-zero voxels without brain labels, we fit threshold values to each image by maximizing the similarity in number of voxels for each extra-cerebral label. These extra-cerebral labels do not necessarily represent or differentiate meaningful anatomical structures – their purpose is to provide intensity and spatial variability to synthesized regions outside the brain.
In total, the training segmentations contain 46 individual anatomical labels, with 40 brain-specific labels (including CSF), that we merge into the target brain mask . All training segmentations are fit to a image shape with 1-mm isotropic resolution. We emphasize that this geometric preprocessing is not required at test-time.
3.4.2 Evaluation Data
Datasets: Our evaluation data comprise 620 images, split into validation and test subsets of sizes 22 and 598, respectively. We gather these images across seven public datasets, with makeup, resolution, and validation splits outlined in Table 2. The IXI111Acquired from http://brain-development.org/ixi-dataset. dataset features a range of MRI contrasts and modalities, including T1w and T2w as well as PDw, MRA, and DWI. To simplify the DWI evaluation, a single diffusion direction is randomly extracted from each acquisition. The FSM subset (Greve et al., 2021) is derived from in-house data using standard acquisitions as well as quantitative T1 maps (qT1). In-house, pseudo-continuous ASL (PCASL) scans are acquired as stacks of 2D-EPI slices with low resolution and a small FOV that often crops the ventral brain region (Dai et al., 2008). The QIN (Clark et al., 2013; Mamonov and Kalpathy-Cramer, 2016; Prah et al., 2015) dataset comprises pre-contrast, clinical stacks of thick image slices from patients with newly diagnosed glioblastoma. We also include a subset of the infant T1w image dataset, using subjects held-out from training. Lastly, to evaluate the ability of SynthStrip to adapt to imaging modalities beyond MR, we gather a test cohort of brain CT and fluorodeoxyglucose (FDG) PET scans from the CERMEP-IDB-MRXFDG (CIM) database (Mérida et al., 2021).
Ground-truth masks: For each image in the evaluation dataset, we derive a reference brain mask with a pseudo-manual labelling strategy, described as follows. Since every evaluation subject includes a corresponding T1w image, we generate brain masks for these scans using each classical baseline method evaluated in our analysis. Then, an “average” brain mask is computed for each subject by extracting the majority label value at every voxel. We refine the average masks manually before propagating the masks by rigidly aligning each subject’s T1w scan to the remaining image types with a robust registration approach (Reuter et al., 2010). Poor alignments are further refined by hand. We make the reference dataset available online to facilitate future development of skull-stripping techniques, including the original images if permitted by their respective licenses.
We analyze the performance of SynthStrip on diverse whole-head images and compare its 3D skull-stripping accuracy to classical and deep-learning baseline tools.
Baselines: We select a group of skull-stripping baselines based on their popularity, determined by citation count, and effectiveness, as shown in prior work (Fatima et al., 2020; Iglesias et al., 2011). As classical baselines, we choose ROBEX 1.1, BET from FSL 6.0.4, 3dSkullStrip (3DSS) from AFNI 21.0.21, BEaST 1.15, and the FreeSurfer 7.2 watershed algorithm (FSW). Unfortunately, many top-cited, learning-based approaches do not make their models available, even upon request to the authors. A notable exception is Deep MRI Brain Extraction (DMBE), which we therefore include. Default parameters are used for each method except BET, for which the -R option is provided for more accurate brain center estimation. All inputs to FSW and DMBE are resampled to 1-mm isotropic voxel sizes to accommodate the expected input resolution for these methods.
We evaluate the similarity between computed and ground-truth brain masks by measuring their Dice overlap, mean and maximum (Hausdorff) surface distances, and percent difference in total volume. Baseline scores are compared to SynthStrip with a paired sample t-test. Sensitivity and specificity, which measure the percent of true positive and true negative brain labels, respectively, provide further insight into the properties of the computed brain masks.
We assess the broad skull-stripping capability of a SynthStrip model trained using images synthesized from the label maps outlined in Section 3.4.1. We compare the accuracy of our method to each of the baselines across the test set of real brain images defined in Section 3.4.2. Method runtime is compared for the FSM dataset.
). For every evaluation metric, brain masks predicted by SynthStrip yield significantly better scores than baseline masks (P0.05) for the vast majority of datasets. Importantly, no baseline method significantly outperforms SynthStrip on any dataset. As shown in Figure 4, SynthStrip achieves the highest Dice score and lowest mean surface distance for more than 80% of all test images, in stark contrast to the next best performing method, BET, which yields the top result for less than 10% of images. The superior performance of SynthStrip persists even when considering only T1w, near-isotropic, adult-brain images, which all of the baselines are tuned for. Across this particular subset of 127 T1w images from the IXI, FSM, and ASL datasets, SynthStrip achieves the best mean Dice, surface distance, Hausdorff distance, and volume difference (Figure 5), and it consistently extracts the brain with high specificity and sensitivity, while other methods tend to under-perform in either of those metrics due to tendencies to substantially over- or under-label the brain. When considering the remaining non-T1w and thick-slice image types, SynthStrip predominance is similarly substantial (Figure 6). For FSM T1w data, our method runs on the CPU in less than one minute (Table 5), trailing the fastest two baselines, BET and FSW, by approximately 17 seconds on average. On the GPU, SynthStrip runs substantially faster, requiring only seconds.
|Mean Surface Distance (mm)|
|CPU Runtime (minutes)|
4.2 Variability Across Time-Series Data
We analyze the consistency of SynthStrip brain masks across time-domain data by assessing the differences between diffusion-encoded directions acquired in the same session. For each subject in the DWI dataset, we affinely align and skull-strip all of the 16 diffusion-encoded frames in a common, average space (Reuter et al., 2010). We compute the number of discordant voxels across brain masks for a given method, defining discordant voxels (DV) as voxel locations with labels that differ in the time domain. We report the percent of DV relative to the brain mask volume, determined by the number of voxels labeled as brain in any frame. In this particular analysis, we only consider ROBEX, BET, and 3DSS as baselines since they generalize to DWI acquisitions. As shown in Figure 7, SynthStrip demonstrates a high level of intra-subject consistency, as it predicts brain masks with substantially lower % DV across DWI directions than the baselines (P ). Since the % DV metric considers voxels labeled as brain for any direction, a single mask with gross mislabeling will substantially increase its value, as is the case with ROBEX, which over-segments the brain for only a few directions per subject.
4.3 Loss Comparison
During our experimentation, we find that training SynthStrip models using a traditional soft Dice loss yields comparable results to those trained with an SDT-based loss for nearly every metric. However, despite similar global accuracy, we observe that models trained with predict brain masks characterized by relatively noisy and rough boundaries, as illustrated in Figure 7. The high variability at the edge of the brain mask is emphasized by a mm increase in maximum surface distance when using compared to . We further quantify this discrepancy in brain mask smoothness by computing the percent of exposed boundary voxels (EBV) that neighbor more non-brain labels than brain labels. Brain masks with noisier boundaries will exhibit larger EBV due to increased mask surface area and number of sporadic border voxels. We perform this evaluation using the FSM data subset of 132 images with isotropic voxel size. Models trained with predict masks with higher EBV than models trained with . We hypothesize that as the network learns to estimate an SDT, it is encouraged to focus more on the boundary of mask, rather than the label as a whole, resulting in a smoother prediction of the brain border.
We present SynthStrip, a learning-based, universal brain extraction tool trained on diverse synthetic images. Subjected to training data that far exceeds the realistic range of medical images, the model learns to generalize across imaging modalities, anatomical variability, and acquisition schemes.
5.1 Baseline Comparison
SynthStrip significantly improves upon baseline skull-stripping accuracy for nearly every image cohort tested, and the few exceptions to this improvement involve data subsets for which SynthStrip matches baseline performance. This predominance is in part due to the ability of SynthStrip to generalize across a wide variety of image types as well as its proclivity to avoid substantial mislabeling. By learning robust, large-scale spatial features of representative brain masks, the model consistently predicts masks of realistic and expected shape. Baseline techniques, on the other hand, often rely on weak spatial priors and are therefore prone to over- or under-segment brain tissue when confronted with image features that are unexpected or unaccounted for (Figure 8).
5.2 Common Failure Modes
The top performing baseline method is ROBEX, which yields high-quality brain extraction across many of the test datasets, with the notable exception of the qT1 cohort. By exploiting a generative point distribution model, ROBEX produces spatially plausible brain masks and evades drastic failure modes that exist in other baselines, similarly to SynthStrip. However, despite its generally good performance, ROBEX has a tendency to include pockets of tissue surrounding the eyes and remove regions of cortical gray matter near the superior surface. Interestingly, ROBEX’s consistent performance across contrasts and modality is somewhat unexpected since the discriminative edge detector is trained only for T1w scans. We hypothesize that the coupled shape model is able to compensate for any intensity bias encoded in the discriminative detector.
BET and 3DSS also perform effective brain extraction across image types, but tend to fail dramatically for outlier cases. For example, BET locates the brain boundary with considerable precision when successful, but for some image subsets, especially those with abundant non-brain matter, such as FSM, BET often includes large regions of inferior skull as well as facial and neck tissue in the brain mask. While 3DSS better avoids such gross mislabeling, it tends to produce skull-strips that leak into neck tissue or, conversely, remove small regions of the cortical surface.
BEaST and FSW perform decently for near-isotropic T1w images, such as those in the IXI, FSM, and ASL datasets, but since they are heavily optimized for the assumed spatial and intensity features of this acquisition type, they generally perform poorly, or even fail completely, for other contrasts. In theory, these two approaches could be effective for other contrasts if they are provided known intensity priors of the brain matter, but this work would need to be done for every new image type. Common error modes of FSW involve the failure to remove bits of skull or inferior non-brain matter, in contrast to BEaST, which is susceptible to removing critical regions of the cortex.
The only learning-based comparison available to us, DMBE, yields suitable brain masks, in particular for near-isotropic image types with T1w contrast, but usually leaves substantial, unconnected components or “islands” of non-brain matter. This is likely a byproduct of its convolution architecture, which does not leverage multiple resolution levels to gather spatial features across large distances. DMBE model inference is slow, consuming more than a half hour to skull-strip a standard image.
5.3 Variability Across Time-Series Data
Rigid registration of individual frames acquired across time for a subject is a common motion-correction procedure applied retrospectively before a diffusion analysis (Holdsworth et al., 2012; Jones and Leemans, 2011) or other time-series analyses, such as those performed for functional MRI (Ashburner, 2009; Jenkinson et al., 2002). However, anatomical structures in the image that deform non-rigidly can hamper such registration and have the potential to impinge on downstream results if left unaccounted for. To increase accuracy, some studies therefore remove non-brain tissue from each frame prior to registration (Andrade et al., 2018; Fischmeister et al., 2013). This procedure requires consistent brain extraction across frames, as the registration algorithm would otherwise attempt to match non-corresponding structures (Andrade et al., 2018; Fein et al., 2006; Fischmeister et al., 2013; Hoffmann et al., 2020), for example if dura is present in one of the frames but not the others. In Section 4.2, SynthStrip exhibits the lowest variability in brain masks across the frames of each subject despite substantial differences in image contrast due to the diffusion encoding, demonstrating its usefulness for guiding brain-specific registration.
5.4 Model and Data Availability
Even as learning-based methods in neuroimaging analysis continue to grow in popularity, developers of deep-learning skull-stripping tools are sometimes disinclined to provide easy-to-use distributions of their work. Out of the three promising methods discussed in this work, only DMBE makes its models and code publicly available for use. In contrast, we make SynthStrip available as a universal, cross-platform command-line utility, distributed both as a standalone and as a built-in FreeSurfer tool. To facilitate further development and testing of robust skull-stripping tools, we also make our evaluation data and ground-truth labels available at https://w3id.org/synthstrip.
5.5 Future Work
While SynthStrip facilitates state-of-the-art brain extraction, we aim to extend the tissue-extraction strategy to other applications both within and beyond neuroimaging. One such application is fetal head extraction from in-utero fetal MRI scans. Due to excessive motion, fetal MRI is limited to the acquisition of sub-second 2D slices. However, stacks of several slices are needed to cover the anatomy of interest, and while their in-plane resolution is typically of the order of 1 mm, views across slices are hampered by slice thicknesses of 4-6 mm and between-slice motion (Hoffmann et al., 2021a)
. To enable full 3D views of the fetal brain, post-processing tools for super-resolution reconstruction have emerged, that aim to reconstruct a high-quality volume of isotropic resolution from a number of slice stacks acquired at different angles(Rousseau et al., 2006; Kainz et al., 2015; Ebner et al., 2020; Iglesias et al., 2021). Yet, these methods hinge on successful brain extraction which is challenging due to frequent artifacts and because the relatively small brain first needs to be localized within a wide FOV encompassing the maternal anatomy (Gaudfernau et al., 2021). In addition, substantially fewer public fetal datasets are available for training in comparison to vast public adult brain datasets. This presents an ideal problem to be addressed with SynthStrip, as our approach synthesizes an endless stream of training data from only a handful of label maps.
The removal of non-brain signal from neuroimaging data is a fundamental first step for many quantitative analyses and its accuracy has a direct impact on downstream results. However, popular skull-stripping utilities are typically tailored to isotropic T1w scans and tend to fail, sometimes catastrophically, on images with other MRI contrasts or stack-of-slices acquisitions that are common in the clinic. We propose SynthStrip, a flexible tool that produces highly accurate brain masks across a landscape of imaging paradigms with widely varying contrast and resolution. We implement our method by leveraging anatomical label maps to synthesize a broad set of training images, optimizing a robust convolutional neural network that is agnostic to MRI contrasts and acquisition schemes.
The authors thank Douglas Greve, Lilla Zöllei, and David Salat for sharing FSM, infant, and ASL data. Support for this research was provided in part by the BRAIN Initiative Cell Census Network grant U01 MH117023, the National Institute for Biomedical Imaging and Bioengineering (P41 EB015896, R01 EB023281, R21 EB018907, R01 EB019956, P41 EB030006), the National Institute of Child Health and Human Development (K99 HD101553), the National Institute on Aging (R56 AG064027, R01 AG016495, R01 AG070988), the National Institute of Mental Health (RF1 MH121885, RF1 MH123195), the National Institute for Neurological Disorders and Stroke (R01 NS070963, R01 NS083534, R01 NS105820), and was made possible by the resources provided by Shared Instrumentation Grants S10 RR023401, S10 RR019307, and S10 RR023043. Additional support was provided by the NIH Blueprint for Neuroscience Research (U01 MH093765), part of the multi-institutional Human Connectome Project. Bruce Fischl has a financial interest in CorticoMetrics, a company whose medical pursuits focus on brain imaging and measurement technologies. These interests are reviewed and managed by Massachusetts General Hospital and Mass General Brigham in accordance with their conflict of interest policies.
- A practical review on medical image registration: from rigid to deep learning based approaches. In 2018 31st SIBGRAPI Conference on Graphics, Patterns and Images (SIBGRAPI), pp. 463–470. Cited by: §1, §5.3.
- A log-euclidean framework for statistics on diffeomorphisms. In MICCAI: Medical Image Computing and Computer Assisted Interventions, pp. 924–31. Cited by: §3.1.
- A fast diffeomorphic image registration algorithm. Neuroimage 38 (1), pp. 95–113. Cited by: §1.
- Preparing fmri data for statistical analysis. In fMRI techniques and protocols, pp. 151–178. Cited by: §5.3.
- Symmetric diffeomorphic image registration with cross-correlation: evaluating automated labeling of elderly and neurodegenerative brain. Medical image analysis 12 (1), pp. 26–41. Cited by: §1.
- A reproducible evaluation of ants similarity metric performance in brain image registration. Neuroimage 54 (3), pp. 2033–2044. Cited by: §1.
- A learning strategy for contrast-agnostic mri segmentation. arXiv preprint arXiv:2003.01995. Cited by: §1, §3.1.
- The lifespan human connectome project in aging: an overview. NeuroImage 185, pp. 335–348. Cited by: §3.4.1.
- Random forests. Machine learning 45 (1), pp. 5–32. Cited by: §2.1.
- The cancer imaging archive (tcia): maintaining and operating a public information repository. Journal of digital imaging 26 (6), pp. 1045–1057. Cited by: §3.4.2.
- Nonlocal patch-based label fusion for hippocampus segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 129–136. Cited by: §2.1.
- Patch-based segmentation using expert priors: application to hippocampus and ventricle segmentation. NeuroImage 54 (2), pp. 940–954. Cited by: §2.1.
- Real-time 3d image registration for functional mri. Magnetic Resonance in Medicine: An Official Journal of the International Society for Magnetic Resonance in Medicine 42 (6), pp. 1014–1018. Cited by: §1.
- AFNI: software for analysis and visualization of functional magnetic resonance neuroimages. Computers and Biomedical research 29 (3), pp. 162–173. Cited by: §1, §2.1.
- Continuous flow-driven inversion for arterial spin labeling using pulsed radio frequency and gradient fields. Magnetic Resonance in Medicine: An Official Journal of the International Society for Magnetic Resonance in Medicine 60 (6), pp. 1488–1497. Cited by: §3.4.2.
- Anatomical priors in convolutional networks for unsupervised biomedical segmentation. In , pp. 9290–9299. Cited by: §3.3.
- Unsupervised learning of probabilistic diffeomorphic registration for images and surfaces. Med.I.A. 57, pp. 226–236. Cited by: §3.1.
- A freesurfer-compliant consistent manual segmentation of infant brains spanning the 0–2 year age range. Frontiers in human neuroscience 9, pp. 21. Cited by: §3.4.1.
- Measures of the amount of ecologic association between species. Ecology 26 (3), pp. 297–302. Cited by: §3.2.
- An automated framework for localization, segmentation and super-resolution reconstruction of fetal brain mri. NeuroImage 206, pp. 116324. Cited by: §5.5.
- BEaST: brain extraction based on nonlocal segmentation technique. NeuroImage 59 (3), pp. 2362–2373. Cited by: §2.1.
- State-of-the-art traditional to the machine-and deep-learning-based skull stripping techniques, models, and algorithms. Journal of Digital Imaging 33 (6), pp. 1443–1464. Cited by: §2, §4.
- Statistical parametric mapping of brain morphology: sensitivity is dramatically increased by using brain-extracted images as inputs. Neuroimage 30 (4), pp. 1187–1195. Cited by: §1, §5.3.
- Whole brain segmentation: automated labeling of neuroanatomical structures in the human brain. Neuron 33 (3), pp. 341–355. Cited by: §3.4.1.
- FreeSurfer. NeuroImage 62 (2), pp. 774–81. Note: 20 YEARS OF fMRI External Links: Cited by: §1.
- The benefits of skull stripping in the normalization of clinical fmri data. NeuroImage: Clinical 3, pp. 369–380. Cited by: §1, §5.3.
- Spatial registration and normalization of images. Human brain mapping 3 (3), pp. 165–189. Cited by: §1.
- Analysis of the anatomical variability of fetal brains with corpus callosum agenesis. In Uncertainty for Safe Utilization of Machine Learning in Medical Imaging, and Perinatal Imaging, Placental and Preterm Image Analysis, pp. 274–283. Cited by: §5.5.
Exact maximum a posteriori estimation for binary images. Journal of the Royal Statistical Society: Series B (Methodological) 51 (2), pp. 271–279. Cited by: §2.1.
- A deep learning toolbox for automatic segmentation of subcortical limbic structures from mri images. Neuroimage 244, pp. 118610. Cited by: §3.4.2.
- Extending the human connectome project across ages: imaging protocols for the lifespan development and aging projects. NeuroImage 183, pp. 972–84. External Links: Cited by: §3.4.1.
- The many faces of robustness: a critical analysis of out-of-distribution generalization. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 8340–8349. Cited by: §1.
Rapid head-pose detection for automated slice prescription of fetal-brain mri. International journal of imaging systems and technology 31 (3), pp. 1136–1154. Cited by: §5.5.
- Learning mri contrast-agnostic registration. In 2021 IEEE 18th International Symposium on Biomedical Imaging (ISBI), pp. 899–903. Cited by: §1, §1, §3.1.
- A survey of patient motion in disorders of consciousness and optimization of its retrospective correction. Magnetic resonance imaging 33 (3), pp. 346–350. Cited by: §1.
- Real-time brain masking algorithm improves motion tracking accuracy in scans with volumetric navigators (vNavs). In International Society for Magnetic Resonance in Medicine, pp. 3367. Cited by: §1, §5.3.
Diffusion tensor imaging (dti) with retrospective motion correction for large-scale pediatric imaging. Journal of Magnetic Resonance Imaging 36 (4), pp. 961–971. Cited by: §5.3.
- 3D u-net for skull stripping in brain mri. Applied Sciences 9 (3), pp. 569. Cited by: §1, §2.2.
- Joint super-resolution and synthesis of 1 mm isotropic mp-rage volumes from clinical mri exams with scans of different orientation, resolution and contrast. NeuroImage 237, pp. 118206. Cited by: §5.5.
- Robust brain extraction across datasets and comparison with publicly available methods. IEEE transactions on medical imaging 30 (9), pp. 1617–1634. Cited by: §1, §2.1, §4.
- Improved optimization for the robust and accurate linear registration and motion correction of brain images. Neuroimage 17 (2), pp. 825–841. Cited by: §5.3.
- Fsl. Neuroimage 62 (2), pp. 782–790. Cited by: §1, §2.1.
- A global optimisation method for robust affine registration of brain images. Medical image analysis 5 (2), pp. 143–156. Cited by: §1.
- Motion detection and correction in functional mr imaging. Human Brain Mapping 3 (3), pp. 224–235. Cited by: §1.
- PSACNN: pulse sequence adaptive fast whole brain segmentation. NeuroImage 199, pp. 553–569. Cited by: §1.
- Diffusion tensor imaging. In Magnetic resonance neuroimaging, pp. 127–144. Cited by: §5.3.
- Fast volume reconstruction from motion corrupted stacks of 2d slices. IEEE transactions on medical imaging 34 (9), pp. 1901–1913. Cited by: §5.5.
- A lifelong learning approach to brain mr segmentation across scanners and protocols. In Medical Image Computing and Computer Assisted Intervention – MICCAI 2018, Cham, pp. 476–484. Cited by: §1.
- Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §3.3.
- Deep mri brain extraction: a 3d convolutional neural network for skull stripping. NeuroImage 129, pp. 460–469. Cited by: §1, §2.2.
- Evaluation of 14 nonlinear deformation algorithms applied to human brain mri registration. Neuroimage 46 (3), pp. 786–802. Cited by: §1.
- Data from qin gbm treatment response. The Cancer Imaging Archive. External Links: Cited by: §3.4.2.
- MP2RAGE, a self bias-field corrected sequence for improved segmentation and T1-mapping at high field. NeuroImage 49 (2), pp. 1271–81. Cited by: §1.
- CERMEP-idb-mrxfdg: a database of 37 normal adult human brain [18f] fdg pet, t1 and flair mri, and ct images available for research. EJNMMI research 11 (1), pp. 1–10. Cited by: §3.4.2.
- V-net: fully convolutional neural networks for volumetric medical image segmentation. In 3DV, pp. 565–71. Cited by: §3.2.
- Global image registration using a symmetric block-matching approach. Journal of medical imaging 1 (2), pp. 024003. Cited by: §1.
- Three-dimensional magnetization-prepared rapid gradient-echo imaging (3D MP RAGE). Magnetic Resonance in Medicine 15 (1), pp. 152–157. External Links: Cited by: §1.
- Comparative evaluation of registration algorithms in different brain databases with varying difficulty: results and insights. IEEE transactions on medical imaging 33 (10), pp. 2039–2065. Cited by: §1.
- PyTorch: an imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32, pp. 8024–8035. Cited by: §3.3.
- Repeatability of standardized and normalized relative cbv in patients with newly diagnosed glioblastoma. American Journal of Neuroradiology 36 (9), pp. 1654–1661. Cited by: §3.4.2.
- Fast and sequence-adaptive whole-brain segmentation using parametric Bayesian modeling. NeuroImage 143, pp. 235–49. External Links: Cited by: §3.4.1.
- Highly accurate inverse consistent registration: A robust approach. NeuroImage 53 (4), pp. 1181–96. External Links: Cited by: §1, §1, §3.4.2, §4.2.
- U-net: convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pp. 234–241. Cited by: §2.2.
- Registration-based approach for reconstruction of high-resolution in utero fetal mr brain images. Academic radiology 13 (9), pp. 1072–1081. Cited by: §5.5.
- Nonrigid registration using free-form deformations: application to breast mr images. IEEE transactions on medical imaging 18 (8), pp. 712–721. Cited by: §1.
- Skull stripping using graph cuts. NeuroImage 49 (1), pp. 225–239. Cited by: §2.1.
- Auto-context convolutional neural network (auto-net) for brain extraction in magnetic resonance imaging. IEEE transactions on medical imaging 36 (11), pp. 2319–2330. Cited by: §1, §2.2.
- A hybrid approach to the skull stripping problem in mri. Neuroimage 22 (3), pp. 1060–1075. Cited by: §2.1.
- Magnetic resonance image tissue classification using a partial volume model. NeuroImage 13 (5), pp. 856–876. Cited by: §2.1.
- Fast robust automated brain extraction. Human brain mapping 17 (3), pp. 143–155. Cited by: §2.1.
- Auto-context and its application to high-level vision tasks and 3d brain image segmentation. IEEE transactions on pattern analysis and machine intelligence 32 (10), pp. 1744–1757. Cited by: §2.2.
- Brain morphometry with multiecho MPRAGE. Neuroimage 40 (2), pp. 559–569. Cited by: §1.
- Diffeomorphic demons: efficient non-parametric image registration. NeuroImage 45 (1), pp. S61–S72. Cited by: §1.
- Infant freesurfer: an automated segmentation and surface extraction pipeline for t1-weighted neuroimaging data of infants 0–2 years. Neuroimage 218, pp. 116946. Cited by: §3.4.1.