1 Introduction
Multiple sclerosis (MS) is the most frequent chronic inflammatory autoimmune disorder of the central nervous system, causing progressive damage and disability. The disease affects nearly half a million Americans and 2.5 million individuals worldwide (rosati2001prevalence; goldenberg2012multiple), generating more than $10 billion in annual healthcare spending in the United States alone (adelman2013cost).
The ability to diagnose MS and track its progression has been greatly enhanced by magnetic resonance imaging (MRI), which can detect characteristic brain lesions in white and gray matter (bakshi2008mri; lovblad2010mr; blystad2015quantitative; GarciaLorenzo2013). Lesions visualized by MRI are up to an order of magnitude more sensitive in detecting disease activity compared to clinical assessment (filippi2006efns). The prevalence and dynamics of white matter lesions are thus used clinically to diagnose MS (Thompson2018), define disease stages and to determine the efficacy of therapeutic regimen (Sormani). MRI is also an unparalleled tool for characterizing brain atrophy, which occurs at a faster rate in patients with MS compared to healthy controls (Barkhof2009; Azevedo2018) and, especially in deep gray matter structures and the cerebral cortex, has been shown to correlate with measures of disability (geurts2012measurement).
Although manual labeling remains the most accurate way of delineating white matter lesions in MS (Commowick2018), this approach is very cumbersome and in itself prone to considerable intra and interrater disagreement (Zijdenbos_MICCAI). Furthermore, manually labeling various normalappearing brain structures to assess atrophy is simply too time consuming to be practically feasible. Therefore, there is a clear need for automated tools that can reliably and efficiently characterize the morphometry of white matter lesions, various neuroanatomical structures, and their changes over time directly from in vivo MRI. Such tools are of great potential value for diagnosing disease, tracking progression, and evaluating treatment. They can also help in obtaining a better understanding of underlying disease mechanisms, and to facilitate more efficient testing in clinical trials. Ultimately, automated software tools may help clinicians to prospectively identify which patients are at highest risk of future disability accrual, leading to better counseling of patients and better overall clinical outcomes.
Despite decades of methodological development (cf. (GarciaLorenzo2013) or (Danelakis2018)), currently available computational tools for analyzing MRI scans of MS patients remain limited in a number of important ways:
 

Poor generalizability: Existing tools are often developed and tested on very specific imaging protocols, and may not be able to work on data that is acquired differently. Especially with the strong surge of supervised learning in recent years, where the relationship between image appearance and segmentation labels in training scans is directly and statically encoded, the segmentation performance of many stateoftheart algorithms will degrade substantially when applied to data from different scanners and acquisition protocols
(GarciaLorenzo2013; Valverde2019), severely limiting their usefulness in practice.  

Dearth of available software: Despite the very large number of proposed methods, most algorithms are only developed and tested inhouse, and very few tools are made publicly available (Shiee2010; Schmidt2012; Bianca; Valverde2017). In order to secure that computational methods will make a real practical impact, they must be accompanied by software implementations that work robustly across a wide array of image acquisitions; that are made publicly available; and that are opensourced, rigorously tested and comprehensively documented.
 

Limitations in assessing atrophy: There is a lack of dedicated tools for characterizing brain atrophy patterns in MS: many existing methods characterize only aggregate measures such as global brain or gray matter volume (smith2002accurate; smeets2016reliable) rather than individual brain structures, or require that lesions are presegmented so that their MRI intensities can be replaced with placeholder values to avoid biased atrophy measures (sdika2009nonrigid; chard2010reducing; battaglini2012evaluating; gelineau2012effect; ceccarelli2012impact; Azevedo2018) (socalled lesion filling).
In order to address these limitations, we describe a new opensource software tool for simultaneously segmenting white matter lesions and 41 neuroanatomical structures from MRI scans of MS patients. An example segmentation produced by this tool is shown in Fig. 1
. By performing lesion segmentation in the full context of wholebrain modeling, the method obviates the need to segment lesions and assess atrophy in two separate processing phases, as currently required in lesion filling approaches. The method works robustly across a wide range of imaging hardware and protocols by completely decoupling computational models of anatomy from models of the imaging process, thereby sidestepping the intrinsic generalization difficulties of supervised methods such as convolutional neural networks. Our software implementation is freely available as part of the FreeSurfer neuroimaging analysis package
(FreeSurfer).To the best of our knowledge, only two other methods have been developed for joint wholebrain and white matter lesion segmentation in MS. (Shiee2010) model lesions as an extra tissue class in an unsupervised wholebrain segmentation method (Bazin)
, removing false positive detections of lesions using a combination of topological constraints and handcrafted rules implementing various intensity and distancebased heuristics. However, the method segments only a small set of neuroanatomical structures (10), and validation of this aspect was limited to a simulated MRI scan of a single subject.
(Mckinley2019) use a cascade of two convolutional neural networks, with the first one skullstripping individual image modalities and the second one generating the actual segmentation. However, the wholebrain segmentation performance of this method was only evaluated on a few structures (7). Furthermore, as a supervised method its applicability on data that differs substantially from its training data will necessarily be limited.A preliminary version of this work was presented in Puonti2016a. Compared to this earlier work, the current article employs more advanced models for the shape and appearance of white matter lesions, and includes a more thorough validation of the segmentation performance of the proposed method, including an evaluation of the wholebrain segmentation component and comparisons with human interrater variability.
2 Contrastadaptive wholebrain segmentation
We build upon a method for wholebrain segmentation called Sequence Adaptive Multimodal SEGmentation (SAMSEG) that we previously developed (Puonti2016), and that we propose to extend with the capability to handle white matter lesions. SAMSEG robustly segments 41 structures from head MRI scans without any form of preprocessing or prior assumptions on the scanning platform or the number and type of pulse sequences used. Since we build heavily on this method for the remainder of the paper, we briefly outline its main characteristics here.
SAMSEG is based on a generative approach, in which a forward probabilistic model is inverted to obtain automated segmentations. Let denote a matrix collecting the intensities in a multicontrast brain MR scan with
voxels, where the vector
contains the intensities in voxel for each of the available contrasts. Furthermore, let be the corresponding labels, where denotes one of the possible segmentation labels assigned to voxel. SAMSEG estimates a segmentation
from MRI data by using a generative model, illustrated in black in Fig. 2. According to this model, is sampled from a segmentation prior , after which is obtained by sampling from a likelihood function , where and are model parameters with priors and . Segmentation then consists of inferring the unknown from the observed under this model. In the following, we summarize the segmentation prior and the likelihood used in SAMSEG, as well as the way the resulting model is used to obtain automated segmentations.2.1 Segmentation prior
To model the spatial configuration of various neuroanatomical structures, we use a deformable probabilistic atlas as detailed in (Puonti2016). In short, the atlas is based on a tetrahedral mesh, where the parameters are the spatial positions of the mesh’s vertices, and is a topologypreserving deformation prior that prevents the mesh from tearing or folding (Ashburner2000). The model assumes conditional independence of the labels between voxels for a given deformation:
and computes the probability of observing label
at voxel as(1) 
where are label probabilities defined at the vertices of the mesh, and
denotes a spatially compact, piecewiselinear interpolation basis function attached to the
vertex and evaluated at the voxel (VanLeemput2009).The topology of the mesh, the mode of the deformation prior , and the label probabilities can be learned automatically from a set of segmentations provided as training data (VanLeemput2009)
. This involves an iterative process that combines a mesh simplification operation with a groupwise nonrigid registration step to warp the atlas to each of training subjects, and an Expectation Maximization (EM) algorithm
(Dempster1977JRSS) to estimate the label probabilities in the mesh vertices. The result is a sparse mesh that encodes highdimensional atlas deformations through a compact set of vertex displacements. As described in (Puonti2016), the atlas used in SAMSEG was derived from manual wholebrain segmentations of 20 subjects, representing a mix of healthy individuals and subjects with questionable or probable Alzheimer’s disease.2.2 Likelihood function
For the likelihood function we use a Gaussian model for each of the different structures. We assume that the bias field artifact can be modelled as a multiplicative and spatially smooth effect (Wells1996). For computational reasons, we use logtransformed image intensities in , and model the bias field as a linear combination of spatially smooth basis functions that is added to the local voxel intensities (VanLeemput1999). Letting
collect all bias field parameters and Gaussian means and variances, the likelihood is defined as
where denotes the number of bias field basis functions, is the basis function evaluated at voxel , and holds the bias field coefficients for MRI contrast . We use a flat prior for the parameters of the likelihood: .
2.3 Segmentation
For a given MRI scan , segmentation proceeds by computing a point estimate of the unknown model parameters :
which effectively fits the model to the data. Details of this procedure are given in A. Once is found, the corresponding maximum a posteriori (MAP) segmentation
is obtained by assigning each voxel to the label with the highest probability, i.e., , where are probabilistic label assignments
(2) 
evaluated at the estimated parameters . It is worth emphasizing that, since the class means and variances are estimated from each target scan individually, the model automatically adapts to each scan’s specific intensity characteristics – a property that we demonstrated experimentally on several data sets acquired with different imaging protocols, scanners and field strengths in Puonti2016.
Our implementation of this method, written in Python with the exception of C++ parts for the computationally demanding optimization of the atlas mesh deformation, is available as part of the opensource package FreeSurfer^{1}^{1}1http://freesurfer.net/. It segments MRI brain scans without any form of preprocessing such as skull stripping or bias field correction, taking around 10 minutes to process one subject on a stateoftheart computer (measured on a machine with an Intel 12core i78700K processor). As explained in (Puonti2016), in our implementation we make use of the fact that many neuroanatomical structures share the same intensity characteristics in MRI to reduce the number of free parameters in the model (e.g., all white matter structures share the same Gaussian mean and variance
, as do most gray matter structures). Furthermore, for some structures (e.g., nonbrain tissue) we use Gaussian mixture models instead of a single Gaussian. In addition to using full covariance matrices
, our implementation also supports diagonal covariances, which is currently selected as the default behavior.3 Modeling lesions
In order to make SAMSEG capable of additionally segmenting white matter lesions, we augment its generative model by introducing a binary lesion map , where indicates the presence of a lesion in voxel . The augmented model is depicted in Fig. 2, where the blue parts indicate the additional components compared to the original SAMSEG method. The complete model consists of a joint (i.e., over both and simultaneously) segmentation prior , where is a new latent variable that helps constrain the shape of lesions, as well as a joint likelihood , where are new parameters that govern their appearance. In the following, we summarize the segmentation prior and the likelihood used in the augmented model, as well as the way the resulting model is used to obtain automated segmentations.
3.1 Segmentation prior
We use a joint segmentation prior of the form
where is the deformable atlas model defined in Section 2.1, and
is a factorized model where the probability that a voxel is part of a lesion is given by:
Here aims to enforce shape constraints on lesions, whereas takes into account a voxel’s spatial location within its neuroanatomical context. Below we provide more details on both these components of the model.
3.1.1 Modeling lesion shapes
In order to model lesion shapes, we use a variational autoencoder (Kingma2013; Rezende2014) according to which lesion segmentation maps are generated in a twostep process: An unobserved, lowdimensional code
is first sampled from a spherical Gaussian distribution
, and subsequently “decoded” into by sampling from a factorized Bernoulli model:Here are the outputs of a “decoder” convolutional neural network (CNN) with filter weights , which parameterize the model.
Given a training data set in the form of binary segmentation maps , suitable network parameters can in principle be estimated by maximizing the logprobability assigned to the data by the model:
However, because the integral over the latent codes makes this is intractable, we use amortized variational inference in the form of stochastic gradient variational Bayes (Kingma2013; Rezende2014). In particular, we introduce an approximate posterior
where the functions and are implemented as an “encoder” CNN parameterized by . The variational parameters are then learned jointly with the model parameters by maximizing a variational lower bound
using stochastic gradient descent, where
(3) 
The first term is the KullbackLeibler divergence between the approximate posterior and the prior, which can be evaluated analytically. The expectation in the last term is approximated using Monte Carlo sampling, using a change of variables (known as the “reparameterization trick”) to reduce the variance in the computation of the gradient with respect to
(Kingma2013; Rezende2014).Our training data set was derived from manual lesion segmentations in 212 MS subjects, obtained from the University Hospital of Basel, Switzerland. The segmentations were all affinely registered and resampled to a 1mm isotropic grid of size 197 233 189. In order to reduce the risk of overfitting to the training data, we augmented each segmentation in the training data set by applying a rotation of 10 degrees around each axis, obtaining a total of 1484 segmentations. The architecture for our encoder and decoder networks is detailed in Fig. 3
. We trained the model for 1000 epochs with minibatch size of 10 using Adam optimizer
(Kingma) with a learning rate of 1e4. We approximated the expectation in the variational lower bound of Eq (3) by using a single Monte Carlo sample in each step.Lesion shape model architecture consisting of two symmetrical convolutional neural networks: (a) decoder network and (b) encoder network. The decoder network generates lesion segmentations from a lowdimensional code. Its architecture has ReLU activation functions (
) and batch normalization
(Ioffe2015) between each deconvolution layer, with the last layer having a sigmoid activation function, ensuring . The encoder network encodes lesion segmentations into a latent code. The main differences compared to the decoder network are the use of convolutional layers instead of deconvolutional layers and, to encode the mean and variance parameters, the last layer has been splitted in two, with no activation function for the mean and a softplus activation function () for the variance.3.1.2 Modeling the spatial location of lesions
In order to encode the spatially varying frequency of occurrence of lesions across the brain, we model the probability of finding a lesion in voxel , based on its location alone, as
where lesion probabilities defined in the vertices of the SAMSEG atlas mesh are interpolated at the voxel location. This effectively defines a lesion probability map that deforms in conjunction with the SAMSEG atlas to match the neuroanatomy in each image being segmented, allowing the model to impose contextual constraints on where lesions are expected to be found.
We estimated the parameters by running SAMSEG on MRI scans (T1weighted (T1w) and FLAIR) of 54 MS subjects in whom lesions had been manually annotated (data from the University Hospital of Basel, Switzerland), and recording the estimated atlas deformations. The parameters were then computed from the manual lesion segmentations by applying the same technique we used to estimate the parameters in the SAMSEG atlas training phase (cf. Sec. 2.1).
3.2 Likelihood function
For the likelihood, which links joint segmentations to intensities , we use the same model as SAMSEG in voxels that do not contain lesion (), but draw intensities in lesions () from a separate Gaussian with parameters :
where
In order to constrain the values that the lesion intensity parameters can take, we make them conditional on the remaining intensity parameters using a normalinverseWishart distribution:
(4) 
Here the subscript “WM” denotes the white matter Gaussian and and
are hyperparameters in the model.
This choice of model is motivated by the fact that the normalinverseWishart distribution is a conjugate prior for the parameters of a Gaussian distribution: Eq. (
4) can be interpreted has providing “pseudovoxels” with empirical mean and variance in scenarios where the lesion intensity parameters and need to be estimated from data. In the absence of any such pseudovoxels (), Eq. (4) reduces to a flat prior on and lesions are modeled as a completely independent class. Although such models have been used in the literature (Guttmann1999; Kikinis1999; Shiee2010; Sudre2015) their robustness may suffer when applied to subjects with no or very few lesions, such as controls or patients with early disease, since there is essentially no data to estimate the lesion intensity parameters from. In the other extreme case, the number of pseudovoxels can be set to such a high value () that the intensity parameters of the lesions are fully determined by those of WM. This effectively replaces the Gaussian intensity model for WM in SAMSEG by a distribution with longer tails, in the form of a mixture of two Gaussians with identical means () but variances that differ by a constant factor ( vs.). In this scenario, MS lesions are detected as model outliers in a method using robust model parameter estimation, another technique that has also frequently been used in the literature
(VanLeemput2001; AitAli2005; Bricq2008; Rousseau2008; Prastawa2008; Liu2009; GarciaLorenzo2011).Based on pilot experiments on a variety of datasets (distinct from the ones used in the results section), we found that good results are obtained by using an intermediate value of pseudovoxels for 1mm isotropic scans, together with a scaling factor . In order to adapt to different image resolutions, is scaled inversely proportionally with the voxel size in our implementation. We will visually demonstrate the role of these hyperparameters in constraining the lesion intensity parameters in Sec. 3.4.
3.3 Segmentation
As in the original SAMSEG method, segmentation proceeds by first obtaining point estimates that fit the model to the data, and then inferring the corresponding segmentation posterior:
which is now jointly over and simultaneously. Unlike in SAMSEG, however, both steps are made intractable by the presence of the new variables and in the model. In order to sidestep this difficulty, we obtain through a joint optimization over both and :
in a simplified model in which the constraints on lesion shape have been removed, by clamping all decoder network outputs to value . This simplification is defensible since the aim here is merely to find appropriate model parameters, rather than highly accurate lesion segmentations. By doing so, the latent code is effectively removed from the model and the optimization simplifies into the one used in the original SAMSEG method, with only minor modifications due to the prior . Details are provided in B.
Once parameter estimates are available, we compute segmentations using the factorization
first estimating from , and then plugging this into to estimate :

Evaluating involves marginalizing over both and , which we approximate by drawing Monte Carlo samples from :
This allows us to estimate the probability of lesion occurrence in each voxel, which we then compare with a userspecified threshold value
to obtain the final lesion segmentation . Details on how we approximate using Monte Carlo sampling are provided in C.

Voxels that are not assigned to lesion () in the previous step are finally assigned to the neuroanatomical structure with the highest probability , which simply involves computing with defined in Eq. (2).
In agreement with other work (VanLeemput2001; AitAli2005; Prastawa2008; Shiee2010; GarciaLorenzo2011; Jain2015), we have found that using known prior information regarding the expected intensity profile of MS lesions in various MRI contrasts can help reduce the number of false positive detections. Therefore, we prevent some voxels from being assigned to lesion (i.e., forcing ) based on their intensities in relation to the estimated intensity parameters : In our current implementation only voxels with an intensity higher than the mean of the gray matter Gaussian in FLAIR and/or T2 (if these modalities are present) are considered candidate lesions.
Since estimating involves repeatedly invoking the decoder and encoder networks of the lesion shape model, as detailed in C
, we implemented the proposed method as an addon to SAMSEG in Python using the Tensorflow library
(Abadi2016). Estimating has the same computational complexity as running SAMSEG (i.e., taking approximately 10 minutes on a stateoftheart machine), while the Monte Carlo sampling takes an additional 5 minutes on a GeForce GTX 1060 graphics card, bringing the total computation time to around 15 minutes per subject.3.4 Illustration of the method
In order to illustrate the effect of the various components of the method, here we analyze its behaviour when segmenting T1wFLAIR scans of two MS subjects – one with a low and one with a high lesion load. Fig. 4 shows, in addition to the input data and the final lesion probability estimate , also an intermediate lesion probability obtained with the simplified model used to estimate , i.e., before the FLAIRbased intensity constraints and the lesion shape constraints are applied. From these images we can see that the lesion shape model and the intensity constraints help remove false positive detections and enforce more realistic shapes of lesions, especially for the case with low lesion load.
Fig. 5 analyzes the effect of the prior on the lesion intensity parameters for the two subjects shown in Fig. 4
. When the lesion load is high, the prior does not have a strong influence, leaving the lesion Gaussian “free” to fit the data. However, when the lesion load is low, the lesion Gaussian is constrained to retain a wide variance and a mean close to the mean of WM, effectively turning the model into an outlier detection method for WM lesions. This behavior is important in cases when few lesions are present in the images, ensuring the method works robustly even when only limited data is available to estimate the lesion Gaussian parameters.
In order to demonstrate that the model also works robustly in control subjects (with no lesions at all), and can therefore be safely applied in studies comparing MS subjects with controls, we further segmented T1wFLAIR scans of 50 MS patients and 23 healthy control subjects (data taken from the Achieva dataset described in Sec 4.1), and computed the total volume of the lesions in each subject. The results are shown in Fig. 6; the volumes were 8.95 9.18 ml for MS subjects vs. 0.98 0.77 ml for controls. Although the average lesion volume for controls was not exactly zero, a visual inspection revealed that this was due to some controls having WM hyperintensities that were segmented by the method as MS lesions, which we find acceptable.
4 Experiments
In this section, we describe three datasets that we will use for the experiments in this paper. We also outline two stateoftheart methods for MS lesion segmentation that the proposed method is benchmarked against, as well as the metrics and measures used in our experiments.
4.1 Datasets
In order to test the the proposed method and demonstrate its contrastadaptiveness, we conducted experiments on three datasets acquired with different scanner platforms, field strengths, acquisition protocols and image resolution:
 

MSSeg: This dataset is the publicly available training set of the MS lesion segmentation challenge that was held in conjunction with the MICCAI 2016 conference (Commowick2018). It consists of 15 MS cases from three different scanners, all acquired using a harmonized imaging protocol (Cotton2015). For each patient a 3D T1w sequence naïve and contrastenhanced (T1c), an axial dual PDT2weighted (T2w) sequence and a 3D fluid attenuation inversion recovery (FLAIR) sequence were acquired. Each subject’s lesions were delineated by seven different raters on the FLAIR scan and, if necessary, corrected using the T2w scan. These delineated images were then fused to create a consensus lesion segmentation for each subject. Both raw images and preprocessed images (preprocessing steps: denoising, rigid registration, brain extraction and bias field correction – see Commowick2018 for details) were made available by the challenge organizers. In our experiments we used the preprocessed data, which required only minor modifications in our software to remove nonbrain tissues from the model. We note that the original challenge also included a separate set of 38 test subjects, but at the time of writing this data is no longer available.
 

Trio: This dataset consists of 40 MS cases acquired on a Siemens Trio 3T scanner at the Danish Research Center of Magnetic Resonance (DRCMR). For each patient, a 3D T1w sequence, a T2w sequence and a FLAIR sequence were acquired. Ground truth lesion segmentations were automatically delineated on the FLAIR images using Jim software^{2}^{2}2http://www.xinapse.com/, and then checked and, if necessary, corrected by and expert rater at DRCMR using the T2w and MPRAGE images.
 

Achieva: This dataset consists of 50 MS cases and 25 healthy controls acquired on a Philips Achieva 3T scanner at DRCMR. After a visual inspection of the images, we decided to remove 2 healthy controls from the dataset as they present marked gray matter atrophy and white matter hyperintensities. For each patient, a 3D T1w sequence, a T2w sequence and a FLAIR sequence were acquired. Ground truth lesion segmentations were delineated using the same protocol as the one used for the Trio dataset.
A summary of the datasets, with scanner type, image modalities and voxel resolution details, can be found in Table 1. For each subject all the contrasts were coregistered and resampled to the FLAIR scan for MSSeg, and to the T1w scan for Trio and Achieva. This is the only preprocessing step required by the proposed method.
Dataset  Scanner  Modality  Voxel resolution [mm]  Subjects 

MSSeg  Philips Ingenia 3T  3D FLAIR  0.74 0.74 0.7  5 
3D T1w  0.74 0.74 0.85  
3D T1c  0.74 0.74 0.85  
2D T2w  0.45 0.45 3  
2D PD  0.45 0.45 3  
Siemens Aera 1.5T  3D FLAIR  1.03 1.03 1.25  5  
3D T1w  1.08 1.08 0.9  
3D T1c  1.08 1.08 0.9  
2D T2w  0.72 0.72 4 (Gap: 1.2)  
2D PD  0.72 0.72 4 (Gap: 1.2)  
Siemens Verio 3T  3D FLAIR  0.5 0.5 1.1  5  
3D T1w  1 1 1  
3D T1c  1 1 1  
2D T2w  0.69 0.69 3  
2D PD  0.69 0.69 3  
Trio  Siemens Trio 3T  2D FLAIR  0.7 0.7 4  40 
3D T1w  1 1 1  
2D T2w  0.7 0.7 4  
Achieva  Philips Achieva 3T  3D FLAIR  1 1 1  73 
3D T1w  0.85 0.85 0.85  
3D T2w  0.85 0.85 0.85 
4.2 Benchmark methods for lesion segmentation
In order to evaluate the lesion segmentation component of the proposed method, we benchmarked it against two publicly available and widely used algorithms for MS lesion segmentation:
 

LSTlga^{3}^{3}3https://www.appliedstatistics.de/lst.html (Schmidt2012): This lesion growth algorithm starts by segmenting a T1w image into three main tissue classes (CSF, GM and WM) using SPM12^{4}^{4}4https://www.fil.ion.ucl.ac.uk/spm/software/spm12/, and combines the resulting segmentation with coregistered FLAIR intensities to calculate a lesion belief map. A prechosen initial threshold is then used to create an initial binary lesion map, which is subsequently grown along voxels that appear hyperintense in the FLAIR image. We set to its recommended default value of 0.3, which was also used in previous studies (Muhlau2013; Rissanen2014).
 

NicMsLesions^{5}^{5}5https://github.com/sergivalverde/nicMsLesions (Valverde2017; Valverde2019)
: This deep learning method is based on a cascade of two 3D convolutional neural networks, where the first one reveals possible candidate lesion voxels, and the second one reduces the number of false positive outcomes. Both networks were trained on T1w and FLAIR scans coming from a publicly available training dataset of the MS lesion segmentation challenge held in conjunction with the MICCAI 2008 conference
(MICCAI2008) (20 cases) and the MSSeg dataset (15 cases). This method was one of the top performers on the test dataset of the MICCAI 2016 challenge (Commowick2018), and one of the few methods for which an implementation is publicly available.
We note that both these benchmark methods are specifically targeting T1wFLAIR input, whereas the proposed method is not tuned to any particular combination of input modalities.
4.3 Metrics and measures
In order to evaluate the influence of varying the input modalities on the segmentation performance of the proposed method, and to assess segmentation accuracy with respect to that of other methods and human raters, we used a combination of segmentation volume estimates, Pearson correlation coefficients between such estimates and reference values, and Dice scores. Volumes were computed by counting the number of voxels assigned to a specific structure and converting into mm, whereas Dice coefficients were computed as
where and denote segmentation masks, and counts the number of voxels in a mask.
The proposed method and both benchmark algorithms produce a probabilistic lesion map that needs to be thresholded to obtain a final lesion segmentation. This requires an appropriate threshold value to be set for this purpose (variable in the proposed method). In order to ensure an objective comparison between the methods, we used a leaveoneout crossvalidation strategy in which the threshold for each test image was set to the value that maximizes the average Dice overlap with manual segmentations in all the other images of the same dataset.
5 Results
In this section, we first evaluate how the proposed model adapts to different input modalities and acquisition platforms. We then compare the lesion segmentation performance of our model against that of the two benchmark methods, and relate it to human interrater variability. Finally, we perform an indirect validation of the wholebrain segmentation component of the method.
Throughout the section we use boxplots to show some of the results. In these plots, the median is indicated by a horizontal line, plotted inside boxes that extend from the first to the third quartile values of the data. The range of the data is indicated by whiskers extending from the boxes, with outliers represented by circles.
5.1 Scanner and contrast adaptive segmentations
In order to demonstrate the ability of our method to adapt to different types and combinations of MRI sequences acquired with different scanners, we show the method’s segmentation results along with the manual segmentations for a representative subset of combinations for one subject in the MSSeg (consensus as manual segmentation), the Trio and the Achieva datasets in Fig. 7. It is not feasible to show all possible combinations. For instance, mixing the 5 contrasts in the MSSeg dataset alone already yields 31 possible multicontrast combinations. Nonetheless, it is clear that the model is indeed able to adapt to the specific contrast properties of its input scans. A visual inspection of its wholebrain segmentation component seems to indicate that the method benefits from having access to the T1w contrast for best performance. This is especially clear when only the FLAIR contrast is provided, as this visually degrades the segmentation of the whitegray boundaries in the cortical regions due to the low contrast between white and gray matter in FLAIR.
When comparing the lesion probability maps produced by the method visually with the corresponding manual lesion segmentations, it seems that the method benefits from having access to the FLAIR contrast for best lesion segmentation performance. This is confirmed by a quantitative analysis shown in Fig. 8, which plots the Dice overlap scores for each of the seven input combinations that all our three datasets have in common, namely T1w, T2w, FLAIR, T1wT2w, T1wFLAIR, T2wFLAIR, and T1wT2wFLAIR. Although the inclusion of additional contrasts does not hurt lesion segmentation performance, across all three datasets the best results are obtained whenever the FLAIR contrast is included as input to the model. This finding is perhaps not surprising, given that the manual delineations were all primarily based on the FLAIR image.
Considering both the wholebrain and lesion segmentation performance together, we conclude that the combination T1wFLAIR is wellsuited for obtaining good results with the proposed method, although it will also accept other and/or additional contrasts beyond T1w and FLAIR.
5.2 Lesion segmentation
In order to compare the lesion segmentation performance of our model against that of the two benchmark methods, and relate it to human interrater variability, we here present a number of results based on the T1wFLAIR input combination (which is the combination required by the benchmark methods).
Comparison with stateoftheart lesion segmentation methods
Fig 9 shows automatic segmentations of two randomly selected subjects from each dataset, both for our method and for the two benchmark methods LSTlga and NicMSLesions, along with the corresponding manual segmentations, where we used the consensus manual segmentations for MSSeg. Visually, all three methods perform similarly on the Achieva MS data, but some of the results for NicMSLesions appear to be inferior to those obtained with the other two methods on MSSeg and Trio data. This qualitative observation is confirmed by the quantitative analysis shown in Fig. 10, where the three methods’ Dice overlap scores are compared on each dataset: similar performances are obtained for all methods on Achieva data, but NicMSLesions trails the other two methods on MSSeg and Trio data. Especially for MSSeg data this is a surprising result, since NicMSLesions was trained on this specific dataset, i.e., the subjects used for testing were part of the training data of this method, potentially biasing the results in favor of NicMSLesions. Based on Dice scores, the proposed method outperforms LSTlga on MSSeg data, although there are no statistically significant differences between the two methods on the other datasets.
Fig. 11
summarizes a final quantitative analysis. For each of the three methods, the automatically computed lesion volumes are plotted against the manually determined ones. In these plots, which pool results across all three datasets, Pearson correlation coefficients and linear regression lines are also indicated. All three methods replicate manual volumetry reasonably well, although NicMSLesions seems to suffer from a number of outliers decreasing the correlation.
Taken together, we conclude that the lesion segmentation performance of the proposed method is on par with the current state of the art method used in the field.
Lesion segmentation performance in terms of Dice overlap with manual raters for the proposed method and two benchmark methods (LSTlga and NicMsLesions) on T1wFLAIR input. Statistically significant differences between two methods, computed with a twotailed paired ttest, are indicated by asterisks (“***” for pvalue
Interrater variability
We evaluated the proposed method’s lesion segmentation performance in the context of human interrater variability, taking advantage of the availability of lesion segmentations by seven different raters in the MSSeg dataset. Table 3 shows the Pearson correlation coefficient of lesion volume estimates between each pair of raters, as well as between each rater and the proposed method on this dataset. The average correlation coefficient between one rater and the six others is in the range [0.91, 0.97] (bottom row of the table), whereas between our method and all human raters it is 0.95 on average, which falls well within the range of human interrater variability. When the degree of spatial overlap is taken into account, however, the human raters are in stronger agreement between themselves than with the proposed method: Table 3 shows the lesion segmentation performance in terms of average Dice overlap between each pair of the seven raters, and between each rater and the proposed method. On average, our method achieves a Dice overlap score of 0.57, which is slightly below the mean human raters’ range of [0.59, 0.69].
5.3 Wholebrain segmentation
Since no ground truth segmentations are available for a direct evaluation of the wholebrain segmentation component of our method, we performed an indirect validation, evaluating its potential for replacing lesion filling approaches that rely on manually annotated lesions, as well as its ability to replicate known atrophy patterns in MS. The results concentrate on the following 25 main neuroanatomical regions, segmented from T1wFLAIR scans: left and right cerebral white matter, cerebellum white matter, cerebral cortex, cerebellum cortex, lateral ventricle, hippocampus, thalamus, putamen, pallidum, caudate, amygdala, nucleus accumbens and brain stem. To avoid cluttering, the quantitative results for left and right structures are averaged. We specify that lesion segmentations are not merged into any of these brain structures (i.e., leaving “holes” in white matter), so that the results reflect performance only for the normalappearing parts of structures.
Comparison with lesion filling
It is wellknown that white matter lesions can severely interfere with the quantification of normalappearing structures when standard brain MRI segmentation techniques are used (chard2010reducing; battaglini2012evaluating; gelineau2012effect; ceccarelli2012impact; nakamura2009segmentation; vrenken2013recommendations). A common strategy is therefore to use a lesionfilling (sdika2009nonrigid; chard2010reducing) procedure, in which lesions are first manually segmented, their original voxel intensities are replaced with normalappearing white matter intensities, and standard tools are then used to segment the resulting, preprocessed images. Using such a procedure with SAMSEG would yield wholebrain segmentations that can serve as “silver standard” benchmarks against which the results of the proposed method (which works directly on the original scans) can be compared. In practice, however, we have noticed that replacing lesion intensities, which is typically done in T1w only, did not work well in FLAIR in our experiments. Therefore, rather than explicitly replacing intensities, we obtained silver standard segmentations by simply masking out lesions during the SAMSEG processing, effectively ignoring lesion voxels during the model fitting.
We wished to interpret segmentation vs. silver standard discrepancies within the context of the human interrater variability associated with manually segmenting lesions. Therefore, we performed experiments on the MSSeg dataset, repeatedly recomputing the silver standard using each of the seven raters’ manual lesion annotations in turn. The results are shown in Tables 5 and 5 for Pearson correlation coefficients between estimated volumes and Dice segmentation overlap scores, respectively. Each line in these tables corresponds to one structure, showing the average consistency between the silver standard of each rater compared to that of the six other raters, as well as the average consistency between the proposed method’s segmentation and the silver standards of all raters. The results indicate that, in terms of Pearson correlation coefficient, the performance of our method falls within the range of interrater variability, albeit narrowly (average value 0.988 vs. interrater range [0.988, 0.992]). In terms of Dice scores, however, the method slightly underperforms compared to the interrater variability (average value 0.971 vs. interrater range [0.978, 0.980]).
Detecting atrophy patterns in MS
In a final analysis, we assessed whether previously reported volume reductions in specific brain structures in MS can automatically be detected with the proposed method. Towards this end, we segmented the 23 controls and the 50 MS subjects of the Achieva dataset, and compared the volumes of various structures between the two groups. Volumes were normalized for age, gender and total intracranial volume by regressing them out with a general linear model. The intracranial volume used for the normalization was computed by summing the volumes of all the structures, as segmented by the method, within the intracranial vault. The results are shown in Fig. 12. Although not all volumes showed significant difference between groups, well established differences were replicated. In particular, we demonstrated decreased volumes of cerebral white matter, cerebral cortex, thalamus and caudate (Chard2002; Houtchens2007; Azevedo2018) as well as an increased volume of the lateral ventricles (Zivadinov2016).
6 Discussion and conclusion
In this paper, we have proposed a method for the simultaneous segmentation of white matter lesions and normalappearing neuroanatomical structures from multicontrast brain MRI scans of MS patients. The method integrates a novel model for white matter lesions into a previously validated generative model for wholebrain segmentation. By using separate models for the shape of anatomical structures and their appearance in MRI, the algorithm is able to adapt to data acquired with different scanners and imaging protocols without needing to be retrained. We validated the method using three disparate datasets, showing stateoftheart performance in white matter lesion segmentation while simultaneously segmenting dozens of other brain structures. We further demonstrated that it can also be applied robustly to MRI scans of healthy controls, and replicate previously documented atrophy patterns in deep gray matter structures in MS. The proposed algorithm is publicly available as part of the opensource neuroimaging package FreeSurfer.
By performing both wholebrain and white matter lesion segmentation at the same time, the method we propose aims to supplant the twostage “lesion filling” procedure that is commonly used in morphometric studies in MS, in which lesions segmented in a first step are used to avoid biasing a subsequent analysis of normalappearing structures with software tools developed for healthy brain scans. In order to evaluate whether our method is successful in this regard, we compared its wholebrain segmentation performance against the results obtained when lesions are segmented a priori by seven different human raters instead of automatically by the method itself. Our results show that the volumes of various neuroanatomical structures obtained when lesions are segmented automatically fall within the range of interrater variability, indicating that the proposed method may be used instead of lesion filling with manual lesion segmentations in large volumetric studies of brain atrophy in MS. When detailed spatial overlap is analyzed, however, we found that the automatic segmentation does not fully reach the performance obtained with human lesion annotation as measured by Dice overlap. Similar results were obtained when we analyzed the segmentation performance of the lesions themselves: while automatically obtained lesions fell within the range of human interrater variability in terms of volumetry, in terms of spatial overlap agreement between human experts was still higher than what can be achieved with the proposed method. We note, however, that the latter result is in line with recent research analyzing the performance of the best lesion segmentation methods currently available (Commowick2018).
For biological validation, we also compared brain volumes segmented by the method between patients with MS and healthy controls. We demonstrated well established alterations of brain volumes in MS. In other words, these group comparisons did not indicate that the integration of the white matter lesion segmentation component into the model disturbs the segmentation of anatomical structures.
Like many other methods for MS lesion segmentation, the method proposed here produces a spatial map indicating in each voxel its probability of belonging to a lesion, which can then be thresholded to obtain a final lesion segmentation. Although in our experience good results can be obtained by using the same threshold value across datasets (e.g., ), changing this value allows one to adjust the tradeoff between false positive and false negative lesion detections. Since some MRI sequences and scanners will depict lesions with a higher contrast than others, and because there is often considerable disagreement between human experts regarding the exact extent of lesions (Zijdenbos_MICCAI), in our implementation we therefore expose this threshold value as an optional, tunable parameter to the enduser. Suitable threshold values can be found by visually inspecting the lesion segmentations of a few cases or, in largescale studies, using crossvalidation as we did in our experiments.
By providing the ability to robustly and efficiently segment multicontrasts scans of MS patients across a wide range of imaging equipment and protocols, the software tool presented here may help facilitate large cohort studies aiming to elucidate the morphological and temporal dynamics underlying disease progression and accumulation of disability in MS. Furthermore, in current clinical practice, highresolution multicontrast images, which can be used to increase the accuracy of lesion segmentation, represent a significantly increased burden for the neuroradiologist to read, and are hence frequently not acquired. The emergence of robust, multicontrast segmentation tools such as ours may help break the link between the resolution and number of contrasts of the acquired data and the human time needed to evaluate it, thus potentially increasing the accuracy of the resulting measures.
The ability of the proposed method to automatically tailor its appearance models for specific datasets makes it very flexible, allowing it to seamlessly take advantage of novel, potentially more sensitive and specific MRI acquisitions as they are developed. Although not extensively tested, the proposed method should make it possible to, with minimal adjustments, segment data acquired with advanced research sequences such as MP2RAGE (MP2RAGEOrig), DIR (DIROrig), FLAIR (FLAIR2) or T2* (T2StartOrig), both at conventional and at ultrahigh magnetic field strengths. We are currently pursuing several extensions of the proposed method, including the ability to go on and create cortical surfaces and parcellations in FreeSurfer, as well as a dedicated version for longitudinal data.
7 Acknowledgments
This project has received funding from the European Union’s Horizon 2020 research and innovation program under the Marie SklodowskaCurie grant agreement No 765148, as well as from the National Institute Of Neurological Disorders and Stroke under project number R01NS112161. Hartwig R. Siebner holds a 5year professorship in precision medicine at the Faculty of Health Sciences and Medicine, University of Copenhagen which is sponsored by the Lundbeck Foundation (Grant Nr. R18620152138). Mark Mühlau was supported by the German Research Foundation (Priority Program SPP2177, Radiomics: Next Generation of Biomedical Imaging) – project number 428223038.
8 Conflicts of interest
Hartwig R. Siebner has received honoraria as speaker from Sanofi Genzyme, Denmark and Novartis, Denmark, as consultant from Sanofi Genzyme, Denmark and as senior editor (NeuroImage) from Elsevier Publishers, Amsterdam, The Netherlands. He has received royalties as book editor from Springer Publishers, Stuttgart, Germany.
Appendix A Parameter optimization in SAMSEG
We here describe how we perform the optimization of with respect to in the original SAMSEG model. We follow a coordinate ascent approach, in which a limitedmemory BFGS optimization of is interleaved with a generalized EM (GEM) optimization of the remaining parameters . The GEM algorithm was derived in (VanLeemput1999) based on (Wells1996), and is repeated here for the sake of completeness. It iteratively constructs a tight lower bound to the objective function by computing the soft label assigments based on the current estimate of (Eq. (2)), and subsequently improves the lower bound (and therefore the objective function) using the following set of analytical update equations for these parameters:
where
and  
Appendix B Parameter optimization
Here we describe how we perform the optimization of with respect to and in the augmented model of Sec. 3 with the decoder outputs all clamped to value . In that case, the model can be reformulated in the same form as the original SAMSEG model, so that the same optimization strategy can be used. In particular, lesions can be considered to form an extra class (with index ) in a SAMSEG model with labels, provided that the mesh vertex label probabilities
are used instead of the original ’s in the atlas interpolation model of Eq. (1).
The optimization described in A does require one modification because of the prior binding the means and variances of the WM and lesion classes together. The following altered update equations for these parameters guarantee that the EM lower bound, and therefore the objective function, is improved in each iteration of the GEM algorithm:
Appendix C Estimating lesion probabilities
We here describe how we we approximate
using Monte Carlo sampling. We use a Markov chain Monte Carlo (MCMC) approach to sample triplets
from the distribution : Starting from an initial lesion segmentation obtained from the parameter estimation procedure described in B, we use a blocked Gibbs sampler in which each variable is updated conditioned on the other ones:where we use the encoder variational approximation obtained during the training of the lesion shape model (see Sec. 3.1.2) to sample from in the nexttolast step, and
in the last step. In these equations, the variables , , and are as defined before, but using voxel assignments . Once samples are obtained, we approximate as
In our implementation, we use samples, obtained after discarding the first 50 sweeps of the sampler (socalled “burnin” phase). The algorithm repeatedly invokes the decoder and encoder networks of the lesion shape model described in Sec. 3.1.2. Since this shape model was trained in a specific isotropic space, the algorithm requires transitioning between this training space and subject space using an affine transformation. This is accomplished by resampling the input and output of the encoder and decoder, respectively, using linear interpolation.
Comments
There are no comments yet.