The past twenty years have seen a dramatic increase in research involving medical imaging, thanks in part to the advent of noninvasive techniques such as magnetic resonance imaging (MRI). Neuroimaging in particular has benefited from MRI technology due to the ability to delineate brain structures and pathology with high spatial resolution. MRI technology itself is evolving and improving, with new protocols and higher field strengths that lead to finer resolution images with better contrast. With improvements to MRI technology comes the challenge to identify and study smaller subregions of the brain. Identification of these anatomical ROIs is fundamental to neuroimaging data analysis, as many studies involve analyzing population-level structural variation and changes associated with disease as well as associations between ROI characteristics and clinical outcomes.
The manual delineation of medical images into ROIs by a trained expert is often time-intensive and costly. Thus, even for moderate-sized research studies, obtaining manual segmentations for each image can be infeasible (Iglesias and Sabuncu, 2015). Additionally, these so-called “gold standard” segmentations provided by human experts, and the resulting ROI volumes, are known to exhibit both intra- and inter-rater variability (Yushkevich et al., 2006). In light of these limitations, methods for automatic and semi-automatic segmentation of ROIs are critical for increasing the feasibility of imaging studies that involve imaging data.
A voxel (volumetric pixel) is the single observational unit in a three-dimensional image, analogous to a pixel in a 2D image. Each voxel in an image contains a single grayscale intensity, so that an image can be represented as an array of numeric values. An atlas is an image for which a labeling exists that associates to each voxel a single label indicating the structure or region to which it belongs. The first generation of atlas-guided segmentation techniques generally employed a single atlas to segment a single target or set of target images (Pham et al., 2000). If one could obtain a perfectly segmented atlas and a perfect registration (mapping) of the atlas to a target image, a single-atlas approach would be sufficient. In practice, atlas labels and image registrations are imperfect. The anatomical variability of even healthy brains simply cannot be captured by using a single atlas. As a result, methods for using information from multiple atlases were introduced, including best atlas selection based on pre-specified criteria (Rohlfing et al., 2004)
and mapping images to a single atlas of label probabilities(Ashburner and Friston, 2005). These approaches were predecessors to multi-atlas segmentation (MAS), a state-of-the-art approach to medical image segmentation that has been widely developed and successfully employed in many application areas. Examples of MAS include whole brain parcellation (Wang et al., 2012) and delineating individual cortical/subcortical ROIs such as the cerebellum (Park et al., 2014), amygdala (Hanson, J. L., Suh, J. W., Nacewicz, B. M., Sutterer, M. J., Cayo, A. A., Stodola, D. E., Burghy, C. A., Wang, H., Avants, B. B., and Yushkevich, P. A., et al., 2012), and the ventricles (Raamana et al., 2014). A particularly important application is segmentation and volumetry of the hippocampus, atrophy of which is known to be associated with dementia and Alzheimer’s disease (Yushkevich et al., 2006; Iglesias et al., 2013). Iglesias and Sabuncu (2015) provide a review of MAS methods.
MAS is characterized by combining information from a collection of atlases for segmenting a novel, unlabeled image. The main components of MAS consist of 1) atlas generation, 2) image registration and label propagation, and 3) label fusion. Developments in the literature have largely focused on improving one or more of these three components. Improvements in segmentation accuracy over single-atlas methods are attributable to the wider range of anatomical variation that is represented and leveraged by MAS. Our work here is focused on the label fusion portion of MAS and therefore assumes that multiple candidate segmentations are available in the target image space. For example, these could have been obtained by registering multiple manually segmented atlases to the target image via nonlinear transformation and propagating each atlas’s labels to the target image space.
A simple and popular approach to label fusion is majority voting (Klein et al., 2005; Heckemann et al., 2006), which takes the mode of the label distribution at each voxel after registration. An early extension to simple majority voting is weighted voting, where each atlas’ ‘vote’ is assigned a weight based on its estimated reliability. These weights may be determined by numerous global metrics, including mutual information (Artaechevarria et al., 2008) or iterative re-weighting based on estimated performance of each atlas (Langerak et al., 2010). Local weighting was introduced to account for the fact that the quality of a given registration can vary spatially over the image. Common approaches include weighting by the voxelwise absolute image intensity differences (Isgum et al., 2009) or the Jacobian of the registration transformation (Ramus and Malandain, 2010).
Bayesian approaches to label fusion have also been proposed in the literature. One of the first is that proposed by Sabuncu et al. (2010), which posits a latent membership field to which each voxel belongs. A similar approach is taken by the STAPLE algorithm (Warfield et al., 2004) in which the observed rater segmentations are treated as corrupted versions of an underlying true segmentation. Akhondi-Asl et al. (2014) extended the STAPLE algorithm by associating estimated reliabilities with the membership field.
To date, most Bayesian approaches to label fusion are treated as optimization problems to find the maximum a posteriori (MAP) estimator (i.e., the posterior mode). They find the mode as a point estimate but stop short of quantifying the uncertainty associated with the estimated classification. The posterior mode can sometimes be misleading, though, especially when the posterior distribution is widely dispersed or strongly skewed(Fox and Nicholls, 2001)
. Further, it is often the case that nuisance parameters are estimated and held fixed instead of being integrated out, thus not fully accounting for all sources of uncertainty in the subsequent solution. It is desirable to access the entire posterior distribution to not only estimate a posterior probability map for segmentation, but to have available any summary measures a researcher may want along with appropriate measures of uncertainty.
In this work, we propose a fully Bayesian model for binary label fusion of MR images. We construct a latent variable regression model that allows incorporation of covariate information. We accommodate the variability of image registration quality both within and between rater images by using spatially-varying sensitivity and specificity processes, thereby facilitating smooth, local weighting of each image as an inherent part of the model. We discuss prior elicitation and implementation of the model via Markov chain Monte Carlo in which we use a novel updating scheme for sampling the processes. We compare our proposed approach to majority voting via simulation and application to neuroimaging data from the Alzheimer’s Disease Neuroimaging Initative (ADNI) in which the goal is to segment the hippocampus and thus to accurately estimate its volume.
We motivate and present our proposed approach in Section 2, along with practical considerations such as possible covariate information, prior elicitation, and implementation. In Section 3 we consider an artificial example with corrupted segmentations to compare its performance to that of majority voting. We consider in Section 4 similar comparisons on clinical MRI data obtained from both healthy subjects and those that have been diagnosed with Alzheimer’s disease (AD), and discuss how our approach can be used to estimate the distribution of plausible volumes of a hippocampus. The latter is an important application since the volume of this structure is known to decrease at an accelerated rate in patients with AD relative to a normally aging elderly population. Section 5 concludes with a discussion of the results and thoughts about future research directions.
Suppose we are interested in segmenting a new brain image, hereinafter referred to as the target image. We have available previously manually segmented images to use as atlases after aligning the labeled brains to the target. This alignment is done via image registration
, the process of transforming one image to the space of another via an affine or nonlinear map. The transformation usually involves matching known anatomical “landmarks” or minimizing a global loss function such as mutual information. The registration problem is itself an ongoing area of research in the imaging literature, but state of the art tools include the FLIRT and FNIRT functions in FSL(Jenkinson and Smith, 2001; Jenkinson et al., 2002; Greve and Fischl, 2009; Jenkinson et al., 2012), DRAMMS (Ou et al., 2011), LDDMM (e.g., Zhong et al., 2010), and SyN registration (Tustison and Avants, 2013). In what follows, we assume that a single algorithm is used to register each atlas to the target, but our proposed framework could easily account for different registration algorithms. In either case, the sources of uncertainty leading to corruption of the atlas labels are (i) the different radiologists and their imperfect manual segmentations, (ii) the set of brains that have been manually segmented, none of which is identical to the target brain, and (iii) the registration algorithm itself, which will never produce a perfect alignment between images. Accounting for these sources of uncertainty when estimating the location of the true structures of interest motivates a fully Bayesian approach.
After registering each atlas to the target, we have images indexed by , where is the number of voxels. Corresponding to atlas is a set of labels , where is the number of classes into which the voxels are being assigned. For segmentation of a single anatomical structure, we take . Due to the sources of error already discussed, the atlases will disagree with each other after registration and will each be a corrupted version of the underlying truth. In addition to the observed labels , let , denote the “true” (unobservable) status of the voxel , so that indicates that voxel is a part of the structure of interest, and otherwise.
We associate to each atlas a sensitivity , an ability to correctly detect when a voxel is truly part of the structure of interest (i.e., an ability to avoid false negatives). Similarly, we have also for each atlas a specificity , or an ability to correctly determine when a voxel is truly outside of the ROI (i.e., an ability to avoid a false positive). Since the quality of any registration can vary throughout an image, the “best” atlas to use actually depends on location. In other words, just because there is poor agreement between the atlas and a target image in one part of the brain, that does not necessarily mean that the atlas is unreliable in other areas (Artaechevarria et al., 2009). Thus, we allow the sensitivity and specificity of each atlas to be a function not only of the specific atlas, but also the voxel. For a single target image, we suppose that and , for atlases and voxels . Given the true voxel statuses and the sensitivity and specificity of each atlas (i.e., the reliability), we assume that observed labels are conditionally independent; i.e.,
Sometimes auxiliary information will be available to inform us about the reliability of a particular atlas at a given location in the image. For instance, when the registered image and the target image are both observed using the same image modality (e.g., T1-weighted MRI), we expect the matched images to have similar image intensities throughout after suitable normalization, meaning that we can use intensity differences as covariate information in modeling the sensitivity and specificity (Isgum et al., 2009). It is important to impose smoothness on the spatially-varying reliabilities to mitigate the effect of image noise (Artaechevarria et al., 2009). To accommodate possible covariate information and to impose spatial smoothness, we model the sensitivities and specificities as
where is the standard Gaussian CDF, and
are known vectors of covariates with coefficientsand , respectively, and and are elements of mean-zero Gaussian Markov random fields (GMRFs; Rue and Held, 2005). Our experience is that it is often sufficient to simply set to let the spatial processes detect the trends a posteriori.
Here, is a neighborhood matrix such that , where if and only if voxels and share an edge or a corner, is the indicator function, and
. Defining neighbors in this way leads to a nonstationary process due to the edge effects; i.e., the voxels on the edges have different numbers of neighbors than the interior voxels, leading to different marginal variances(Besag and Kooperberg, 1995). Solutions to this problem suggested by Besag and Kooperberg (1995)
include ‘zero padding’ the image so that the edge pixels have complete sets of neighbors with the missing neighbors taken to be zero, imputing the missing neighbor values with a typical value such as the median, or retaining the given neighborhood structure but carefully choosing thevalues in (4) to create a process with constant variance throughout the image (Dempster, 1972). Besag and Kooperberg (1995) observed, however, that when regular arrays are of large dimension and the regions of interest are in the interior of the image, edge effects are of secondary interest and have negligible effect on inferences, so one can simply ignore them.
From (2), we see that when for all , does not appear in the likelihood determined by (1) and hence is not updated in the posterior. Since with positive probability, the prior distribution on must be proper to avoid an improper posterior (Brown et al., 2017), and likewise for . Thus, we include the “propriety parameters” and in (4) to force the precision matrices to be nonsingular, as suggested by Banerjee et al. (2015). A sufficient condition is , in which case is strictly diagonally dominant and symmetric, and hence positive definite.
It is likely that we will have available auxiliary information concerning the true location of the structure of interest. This information can be in the form of prior anatomical information (e.g., Ashburner and Friston, 2005) or in the level of agreement among the atlases themselves. Toward this end, we suppose that,
where is a vector of covariates pertaining to voxel , is the corresponding vector of regression coefficients, and is a one-to-one and differentiable link function as used in generalized linear models (GLMs; McCullagh and Nelder, 1989). We assume any dependence between inclusion indicators is completely explained by the covariate information; i.e., .
). We take conventional Gaussian and Gamma priors for these parameters, since they are sufficiently flexible with appropriate specification of the corresponding hyperparameters. Thus, we assume
The resulting posterior density is given in Section 7 of the Supplementary Material.
2.2 Prior Specification
To specify the hyperparameters corresponding to the parameters appearing in the mean functions of and , we recognize that , so that the effect of the mean function on the likely values of the reliability parameters is almost certainly between and . For instance, considering the sensitivity, a priori we suppose that for any and any , for all . With a pure spatial process, this requirement becomes . To use this condition to induce a prior on the spatial effect , we use the fact that , where (Bernardinelli et al., 1995; Eberly and Carlin, 2000). With eight neighbors at a voxel , the desired constraint can be solved to yield , but we can conservatively take the hyperparameters in the Gamma prior on to mildly concentrate the probability mass near one. The argument is the same for specifying the hyperparameters on and .
Similar to specifying the priors on the regression coefficients determining the reliability parameters, we can take in (6), for some appropriate and
the identity matrix, to allow a reasonably flat prior on thescale. Alternatively, with prior knowledge concerning the underlying structure of interest, it is possible to induce an informative prior for through a so-called conditional mean prior (Bedrick et al., 1996). Under this approach, we partition the -dimensional covariate space into regions and choose covariate values that are representative of each of the regions so that is nonsingular. A prior is put directly on the mean response at these covariate values, , where applies
componentwise. While the regression coefficients themselves may be difficult to interpret, we can meaningfully assign some probability distribution to
using a Beta distribution with shape parameters chosen to reflect the prior knowledge at the covariate values.Bedrick et al. (1996) show that if , then the induced prior is , where denotes the first derivative of .
2.3 Implementation via MCMC
For posterior inference, we use Markov chain Monte Carlo sampling (MCMC; Gelfand and Smith, 1990), specifically a Metropolis-Hastings-within-Gibbs scheme (Metropolis et al., 1953; Hastings, 1970; Geman and Geman, 1984; Müller, 1991). The necessary full conditional distributions are given in Section 7 of the Supplementary Material.
Conditional on the latent statuses , sampling can be done using the ‘data augmentation trick’ for Bayesian binary regression (e.g., Albert and Chib, 1993). A more direct approach is the algorithm of Gamerman (1997), inspired by iteratively reweighted least squares fitting of GLMs. Under our proposed model (1) - (6), the full conditional density of is , where is the design matrix constructed from , and . Following McCullagh and Nelder (1989), define and with and for . Then, given the current iterate , the proposal distribution is where and . If the prior on is elicited through the CMP approach of Bedrick et al. (1996), this proposal mechanism can still be used after suitable modification. See Section 8 of the Supplementary Material for a derivation under the logistic link.
Depending on , every voxel contributes information to either or , but not both, due to the definitions of and . Let . Then, following Albert and Chib (1993), we introduce latent variables , and define , and similarly . Letting , this can be expressed as
Using the augmented data likelihood, the full conditional distributions of the spatial effects are given by
For the conditional distribution of , it can be shown that, for all ,
denotes a Gaussian distribution with meanand variance truncated to the interval .
To draw and in block updates, we need to draw from the multivariate Gaussians given by (8) and (9). This can be done by finding Cholesky factorizations of their respective covariance matrices. However, these matrices depend on parameters that change on each iteration of the MCMC routine, meaning that the factorizations have to be calculated repeatedly. Such factorizations are expensive operations in large dimensions. Alternatively, we can exploit the Markov structure of CAR models to parallelize single-site updates of and , thus striking a balance between computational and statistical efficiency.
A convenient feature of CAR models is the availability of easily interpretable conditional distributions that follow from the Markov property. For instance, the CAR prior on implies that , where , , and . Combined with the model for the augmented data , this yields ,
where and .
The Markov property still holds since, conditional on its neighbors, is independent of the rest of the field. Thus, we can update independent elements of simultaneously (e.g., in parallel or through ‘vectorized’ calculations in
R (R Development Core Team, 2008)) while conditioning on all of their common neighbors. Finding the independent sets is equivalent to finding a coloring of the graph corresponding to the GMRF, in which the edges correspond to conditional dependence (Rue and Held, 2005). For a vector , define . Then a coloring produces a partition of the voxels, indexed as , and corresponding cuts , so that This type of Gibbs updating is called chromatic sampling (Gonzalez et al., 2011; Brown and McMahan, 2017) due to its association with graph coloring. With simultaneous updates, chromatic Gibbs sampling dramatically reduces the cost of sequential single-site updating while avoiding the computational burden of drawing from high-dimensional Gaussian distributions. We use chromatic Gibbs sampling on both and to accelerate the MCMC routines.
3 Numerical Studies
To illustrate our proposed approach on an artificial example, we consider the problem of locating an unobservable structure on a two-dimensional image of size . The true structure is displayed in Figure 1. Suppose we have available manually segmented images, resulting in four atlases to use in our proposed model. To simulate imperfect labeling of the original images as well as inconsistencies with the registration, we corrupt the truth through translation, dilation, contraction, and rotation of the true underlying structure. The four corrupted segmentations also are displayed in Figure 1.
To fit the Bayesian label fusion model, we assume a zero mean CAR model in (3) for the rater-specific sensitivities and specificities, so that each is modeled as a probit-transformed pure spatial process. We take the neighbors of a pixel to be the eight surrounding pixels that share either an edge or a corner. As discussed in Section 2, we ignore any edge effects due to the dimension of the image and the likely location of the structure of interest.
To create covariate information for the inclusion indicators , we create signed distance label transforms for each atlas (Iglesias et al., 2012). A signed distance label transform for a binary image assigns to each pixel a number corresponding to its distance from the nearest edge in an image, where an edge is indicated by a zero adjacent to a one. The sign of the distance corresponds to whether or not the pixel is inside or outside of an identified structure in the binary image; e.g., a pixel has negative distance if it is inside the structure, positive distance if it is outside, and zero if it is on the boundary. Signed distance label maps can be created manually, or quickly and easily using the
bwdist function in
MATLAB (The MathWorks, Natick, MA, 2013). In our case, we call
R using the
R.matlab package (Bengtsson, 2016), create the signed distance label maps for each atlas, and read them back into
R. To summarize the information provided by each of the manual segmentations, we create one total signed distance map by summing over the atlas-specific labels at each pixel. The summed signed distance label map is displayed in Figure 2. The distance label at each pixel is used as the covariate for the pixel-level regression model on , given in (5).
To complete the model specification, we set the shape parameters in the Gamma priors in (6) to , to mildly concentrate the precisions and about 1. As discussed in Subsection 2.2, we elicit the prior on the regression coefficients via the conditional mean approach; i.e., , , and . These specifications roughly approximate the addition of 20 new observations at the point at which the signed distance map is zero. A single MCMC chain is run using 100,000 iterations with the first 50,000 discarded as a burn-in period, with every 50th iterate being retained thereafter resulting in a Monte Carlo sample size of 1,000 from the approximate posterior distribution. The simulations are coded entirely in
R. Convergence of the chains is assessed by examining the trace plots of selected parameters.
Figure 3 displays the estimated posterior probability maps of
, along with the approximate standard errors obtained from the MCMC output. We see strong agreement between the areas assigned a high probability of inclusion and the truth. As expected, the probabilities are greatest in the areas where the atlases agree, and drop near the edges where the atlases disagree about inclusion. The disagreement among the images is also reflected in the uncertainties associated with these probabilities, which are available by examining the entire posterior distribution. Of course, such uncertainty quantification is not possible when estimating the inclusion through MAP estimation alone. The estimated marginal posterior density of the regression coefficient on the signed distance labels is shown in Figure3, where we see the negative association between the distance assigned to a pixel and its probability of being truly part of the structure, as expected.
Also displayed in Figure 3 is the estimated structure obtained by thresholding the posterior probabilities at , so that we retain the pixels that have an inclusion probability ‘significantly’ greater than 0.5. The recovery of the truth is substantial, considering the strongly corrupted images used as references (Figure 1). Using such images in the popular majority voting scheme results in poorer quality reconstruction of the true structure. Since the translation is completely enveloped by the dilation, majority voting yields a translated image very similar to the upper left corner of Figure 1. To quantify the image similarities, we calculate Dice coefficients (Dice, 1945). Letting denote the indicator of inclusion of pixel in the estimated image (a positive result), the Dice coefficient is calculated as , with values close to one indicating strong agreement between two images, and values close to zero indicating strong disagreement. The Dice coefficients for both our proposed model and majority voting are displayed in Table 1. Using our proposed approach, we see a substantial improvement in similarity between the estimated structure and the truth compared to majority voting.
|Bayesian Regression ()||Bayesian Regression ()||Majority Voting|
One of the reasons for the dramatic improvement in our proposed approach over majority voting is our approach estimates the sensitivity and specificity of each atlas as part of the model. This means that when estimating the probability of inclusion at each pixel, the input provided by each atlas is weighted by its estimated reliability on each iteration of the MCMC algorithm. Further, the reliability of each atlas is allowed to vary smoothly over the image. This is an important feature since the quality of an image registration is not constant over an entire image space, but is better in some parts of an image than others; the smoothness imposed by the assumed spatial process protects the sensitivity and specificity estimates from image noise that is inherent in any MRI. The a posteriori mean sensitivity and specificity fields for each atlas are displayed in Figure 4. We see, for instance, the high sensitivity of the translated image where it overlaps with the truth, and the sharp drop in its sensitivity in the area where it disagrees due to its upward shift. Similarly, we can see in its specificity field the area in which it is giving false positives, again due to the upward shift. Similar interpretations hold for the remaining atlases. Thus each atlas is appropriately weighted when attempting to identify where the true structure lies. This weighting is omitted entirely in the simple majority voting procedure. In addition to obtaining an estimate of the true structure and the associated uncertainty, our procedure facilitates evaluating each individual atlas through the sensitivity and specificity fields. This additional information could potentially be useful for evaluating a particular registration algorithm or a particular manual segmentation that might be used repeatedly when performing label fusion on different target images.
To illustrate the importance of allowing the sensitivity and specificity to vary throughout the image, consider an approach in which we assume a model similar to that proposed, but the reliability parameters of each atlas are taken to be constant throughout an image. Instead of the spatially varying process assumed in (3), we suppose that for each atlas , and . Believing that each atlas should have a high sensitivity to where the true structure lies, we concentrate the prior mass of close to one with and . On the other hand, we take a uniform prior on () to allow for all values to be equally probable a priori on that scale. We run the MCMC similar to before and retain the last half of the chain as an approximate sample from the posterior distribution.
Supplementary Figure 9 displays the approximate posterior inclusion probability map along with the standard error. A poor segmentation is produced by this approach. By assuming that the reliability parameters are constant throughout an image, the model is implying that any particular atlas is either reliable everywhere, or it is reliable nowhere. The result is essentially best atlas selection for choosing the segmentation. In this case, the selected atlas is the dilated one. The reason for this is clear in Supplementary Figure 10, which displays the approximate marginal posterior distributions of the reliability parameters for each atlas. Each atlas is estimated to have a very high specificity so that they are all very good at detecting where the structure is not. However, the dilated atlas has the highest estimated sensitivity. Thus, the algorithm chooses the dilation as the most reliable of the four atlases, resulting in the segmentation that closely matches it. It is interesting to note that the strong Bayesian learning that occurred about these parameters belies the fact that we have made poor modeling assumptions here.
These simulation results demonstrate dramatic improvements over simple majority voting that are possible through a Bayesian spatial regression model. This improvement is possible even when strongly corrupted atlases are used as inputs to the label fusion. These results also demonstrate the importance of allowing the estimated reliability of each atlas to vary over the image. No single atlas is uniformly more reliable than another throughout the image, so it is important to weight them differently in different areas. This is ultimately the flaw with best atlas selection.
4 Application to Hippocampus Segmentation
The Alzheimer’s Disease Neuroimaging Initiative (ADNI; http://www.adni.loni.usc.edu
) is an ongoing, multi-center, longitudinal study with goals of better understanding of progression from normal aging through various states of cognitive decline to Alzheimer’s disease (AD) and to develop effective biomarkers for disease diagnosis and monitoring. Currently, there is no cure for AD. Therefore, it is of critical scientific importance to gain a better understanding of early disease mechanisms in the brain to inspire the development of effective, targeted therapies for the disease.
The association between hippocampal atrophy and AD is well-documented (Likeman et al., 2005; Leung et al., 2010) and volumetric changes over time can be used to aid the diagnosis of early AD and track progression (Dubois et al., 2007). As such, twelve different manual segmentations of the left and right hippocampus were obtained for each subject from the Harmonized Protocol For Hippocampal Volumetry (http://www.hippocampal-protocol.net/SOPs/labels.php#final; Boccardi et al., 2015). The corresponding T1 images were downloaded from the LONI IDA website (https://ida.loni.usc.edu/login.jsp). For each image, we applied N4 inhomogeneity correction to reduce low-frequency scanner artifacts (Tustison et al., 2010). We applied Multi-atlas Skull Stripping to perform brain extraction (Doshi et al., 2013). For each target T1 image, the remaining 11 T1 atlases were non-linearly registered to the target image using
deformable registration with mutual information cost and Welch windowed sinc interpolation(Avants et al., 2011) in
R. The concomitant transformations were applied to each manual segmentation with nearest-neighbor interpolation to obtain the atlases for each target. A single axial slice was chosen for each target image by visual inspection that approximately maximized the hippocampal volume for that subject across slices. The atlas-guided candidate segmentations for a single target image slice are displayed in Figure 5. The model specifications and MCMC implementation details are identical to those used in Section 3. We assess the performance of our proposed method in a ‘leave-one-out’ approach by separately considering each subject as a target image to be segmented and the remaining subjects as atlases with manual segmentations, summarizing the results over all twelve replications.
Figure 6 displays the approximate posterior probability map for hippocampus inclusion, along with the ‘true’ (manual) segmentation in a single slice for one of the target images. We see sharply delineated areas of high probability, so that after thresholding at 0.5 (also shown in the Figure) we obtain an image similar to the probabilities themselves. A difference between the estimated label fusions and the manual segmentation is in the smoothness. The weights assigned to each atlas via the sensitivity and specificity spatial processes have propagated through to the estimated marginal posterior inclusion probabilities so that boundaries between the background tissue and the hippocampus have been smoothed out compared to the manual segmentation. The lower right of Figure 6
displays the estimated pointwise standard deviations, where we can see the highest variability in the probability estimates is near the edges, nearly tracing out a level set around the estimated hippocampus. This agrees with intuition concerning where the most uncertainty would be in the estimates.
There is obvious disagreement between the posterior structural estimate and the manual segmentation, but this is inevitable with any label fusion procedure. Indeed, for the target image displayed in Figure 6, the thresholded posterior probability map attains a Dice coefficient of 0.774. On the other hand, the target-estimate agreement attained by majority voting has a Dice coefficient of 0.737. Table 2 summarizes the Dice coefficients after segmenting each of the 11 target images using both our proposed Bayesian spatial regression (BSR) model and majority voting. In each case, our proposed approach performs comparable to or better than majority voting. An interesting phenomenon here is that our approach does considerably better when applied to the healthy brain images and appears to do worse on the diseased brains. As suggested by Figure 6 and the other posterior segmentations (not shown), our procedure smooths out the boundaries between the segmented structure and the background. Such smoothing at the boundaries, a well-known characteristic of GMRFs (Smith and Fahrmeir, 2007), results in over-segmenting the hippocampus area in each diseased brain when combined with the information provided by registering a healthy brain to a diseased brain. On the other hand, this same feature yields improved estimates of the hippocampus location in healthy brains, since the reliability parameters downweight the atlases that are more inconsistent, protecting against the influence of poor segmentations.
|Healthy||0.799 (0.018)||0.772 (0.025)|
|Diseased||0.731 (0.019)||0.754 (0.020)|
|Overall||0.765 (0.162)||0.763 (0.157)|
An appealing feature of our proposed model is the ability to estimate the spatially-varying reliability (sensitivity and specificity) parameters for each atlas used in segmenting a target image. In the images displayed in Figure 7, the reliability of each atlas throughout an image is clearly visible, allowing one to gauge the overall quality of each individual segmentation. The estimated sensitivities and specificities agree with what we would expect for each atlas based on its difference with the manual segmentation. In particular, we see the smooth decay of sensitivity between regions of high agreement and low agreement with the manual segmentations. Since the sensitivity is by definition conditional on the voxel truly being part of the structure of interest (), the model is only capable of estimating sensitivity in areas where it estimates a high probability of the voxel truly being part of the structure. This is evident in the sensitivity maps, where as we move away from the structure of interest, there are no nearby voxels thought to be included in the structure and thus no information with which to update the spatial fields. Thus the fields return to the prior mean. Similarly, the specificity values are estimated to be very high away from the targeted structure, where all the atlases agree on the voxels belonging to the background. The specificity fields are only strongly determined by their prior mean in areas where is quite low.
In practice, the purpose of segmenting an anatomical structure in the brain is to obtain important information such as its volume or average image intensity within the structure. In our case, segmenting the hippocampus is a step toward estimating its volume. If we simply used majority voting or a similar label fusion procedure to obtain a binary map, then the only way to estimate the volume is by summing the indicators. Doing so ignores uncertainty in the indicators caused by image pre-processing, registration error, and biological variation. Another important source of uncertainty arises from so-called “partial volume effects” in which a particular voxel may not be entirely inside or outside a structure of interest, but partly in both. Our proposed approach addresses this by providing probabilities of inclusion, which can be interpreted as summarizing partial volume effects (Ashburner and Friston, 2005), as well as facilitating estimation of a distribution of plausible volumes through the posterior. For instance, Figure 8 displays the approximate marginal posterior distributions of two possible estimators of volume (in ) of the hippocampus in a single slice of each image. One estimator is , similar to the estimator that one is forced to use with, e.g., majority voting. Alternatively, we try to account for partial volume effects with , where is our estimate of the conditional Bernoulli probability derived in Supplementary Section 7. The latter estimator is similar to a posteriori volume distributions estimated by Iglesias et al. (2013). While the two estimators generally agree on the estimated volumes, there is clearly much more uncertainty associated with the partial volume estimator as opposed to the estimator based on summing indicators. Appropriate volume estimators in structural brain images remains an open question that we leave to future work.
Application of our proposed Bayesian spatial binary regression model to label fusion in the ADNI data demonstrates the feasibility of our approach as well as additional useful information that would be unavailable under most fusion procedures. Posterior probability maps can be used to summarize where the hippocampus is likely to be. A simple thresholding rule yields image similarities to manual segmentation that is competitive with or better than majority voting. Such thresholding may actually be unnecessary, however, as the goal is to estimate the volume of the hippocampus, not to simply classify each voxel into one of two categories. Numerous volume estimators become available with our model, as well as approximations to their marginal posterior distributions via MCMC. We suggest two such estimators here, but others are certainly possible and reasonable.
Anatomical segmentation of brain images is a challenging task that becomes quite resource-intensive when done manually. Nevertheless, such segmentation is necessary for neuroscientists and neurologists to accurately measure traits of both diseased and healthy brains so that they can be related to clinical endpoints. Thus there is a need for the continued development of automated and semi-automated procedures for anatomical segmentation.
In this work we have taken an hierarchical Bayesian approach to incorporating available reference atlases into a label fusion problem for anatomical segmentation. We proposed a general spatial binary regression model that not only uses each atlas to determine the likely location of the target structure, but estimates the reliability of each individual atlas via spatially-varying sensitivity and specificity fields. These reliability parameters are built in to the Bayesian model so that each atlas’ ‘vote’ is automatically weighted when estimating the inclusion probability at each voxel. In contrast to most existing procedures for label fusion, we are able to fully account for all sources of uncertainty by accessing the posterior distribution via MCMC. We discussed chromatic Gibbs sampling for more efficiently updating the Gaussian Markov random fields in the model as well as effective Metropolis-Hastings tailored for Bayesian binary regression. We elicit informative prior distributions about regression parameters via conditional mean priors. With posterior probability maps and their associated distributions, a researcher can use a reasonable thresholding rule to construct a binary segementation. The proposed approach also facilitates volume estimation directly, including estimation of a distribution of plausible volumes. We showed that a simple thresholding rule leads to a binary segmentation as good as or better than simple majority voting. We suggest also two possibilities for estimating distributions of the volume of the hippocampus, a critical problem when studying progression of Alzheimer’s disease.
To date, most label fusion approaches have been designed for intra-modal problems in which the target image and the candidate images are all in the same imaging modality (e.g., T1 is registered to T1, etc.). While not demonstrated in this work, the binary regression model proposed here easily allows incorporating information from different imaging modalities as covariate information, thus facilitating multimodal label fusion, a vastly underexplored area. Additional preprocessing steps, and their inconsistency between research groups, can also be accounted for to estimate their effects on subsequent inferences. Using spatial binary regression and related models for multimodal label fusion and exploration of preprocessing effects are possibly interesting avenues for future research.
Other important areas for future work include further exploration of efficient Bayesian computation for estimating the posterior distribution, a persistent problem for applied Bayesians working with large-scale and high-dimensional data. In addition to algorithmic advances, there is a need for exploring practical computational solutions such as distributing Monte Carlo effort across graphics processing units(Beam et al., 2016; Terenin et al., 2017). We considered segmentation of two-dimensional slices from multiple brain images. A reasonable and computationally feasible approximation to full three-dimensional MCMC analysis would be to simply run multiple slices in parallel while only assuming two-dimensional Markov structure within each slice. A complete three-dimensional spatial model would dramatically increase the computational load and make necessary the use of recent advances in numerical linear algebra (e.g., Golub and Van Loan, 1996) to be feasible. Such advancements also would be very helpful, or even necessary, for extending the binary regression model to multi-class segmentation. Indeed, many imaging studies begin not with simply a binary segmentation of a single structure, but a full parcellation of the brain into multiple structures. Posterior estimation of multi-class regression models is known to be much more computationally expensive than its binary counterpart. Nevertheless, we believe existing approaches and further advancement can bring fully Bayesian brain parcellation with uncertainty quantification well into the realm of possibility.
This material is based upon work partially supported by the National Science Foundation under Grant DMS-1127914 to the Statistical and Applied Mathematical Sciences Institute. DAB is partially supported by NSF Grant CMMI-1563435. CSM is partially supported by National Institutes of Health grant AI121351. RTS is partially supported by R01NS085211, R21NS093349, R01MH112847, R01NS060910, and R01EB017255. The content is solely the responsibility of the authors and does not necessarily represent the official views of the funding agencies. The authors thank the clinical neuroimaging research group at Johns Hopkins University, particularly John Muschelli for his help with the image registrations.
- Akhondi-Asl et al. (2014) Akhondi-Asl, A., Hoyte, L., Lockhart, M., and Warfield, S. (2014). A logarithmic opinion pool based STAPLE algorithm for the fusion of segmentation with associated reliability weights. IEEE Transactions on Medical Imaging, 33(10):1997–2009.
- Albert and Chib (1993) Albert, J. H. and Chib, S. (1993). Bayesian analysis of binary and polychotomous data. Journal of the American Statistical Association, 88(422):669–679.
- Artaechevarria et al. (2008) Artaechevarria, X., Muñoz Barrutia, A., and Ortiz-de Solorzano, C. (2008). Efficient classifier generation and weighted voting for atlas-based segmentation: Two small steps faster and closer to the combination oracle. In Medical Imaging, page 69141W. International Society for Optics and Photonics.
- Artaechevarria et al. (2009) Artaechevarria, X., Muñoz Barrutia, A., and Ortiz-de Solorzano, C. (2009). Combination strategies in multi-atlas image segmentation: Application to brain MR data. IEEE Transactions on Medical Imaging, 28(8):1266–1277.
- Ashburner and Friston (2005) Ashburner, J. and Friston, K. J. (2005). Unified segmentation. NeuroImage, 26(3):839–851.
- Avants et al. (2011) Avants, B. B., Tustison, N. J., Song, G., Cook, P. A., Klein, A., and Gee, J. C. (2011). A reproducible evaluation of ANTs similarity metric performance in brain image registration. NeuroImage, 54(3):2033–2044.
- Banerjee et al. (2015) Banerjee, S., Carlin, B. P., and Gelfand, A. E. (2015). Hierarchical Modeling and Analysis for Spatial Data. Chapman & Hall/CRC, Boca Raton, 2nd edition.
- Beam et al. (2016) Beam, A. L., Ghosh, S. K., and Doyle, J. (2016). Fast Hamiltonian Monte Carlo using GPU computing. Journal of Computational and Graphical Statistics, 25(2):536–548.
- Bedrick et al. (1996) Bedrick, E. J., Christensen, R., and Johnson, W. (1996). A new perspective on priors for generalized linear models. Journal of the American Statistical Association, 91(436):1450–1460.
- Bengtsson (2016) Bengtsson, H. (2016). R.matlab: Read and Write MAT Files and Call MATLAB from Within R. R package version 3.5.1.
- Bernardinelli et al. (1995) Bernardinelli, L., Clayton, D. G., and Montomoli, C. (1995). Bayesian estimates of disease maps: How important are priors? Statistics in Medicine, 14:2411–2431.
- Besag (1974) Besag, J. (1974). Spatial interaction and the statistical analysis of lattice systems (with discussion). Journal of the Royal Statistical Society, Series B, 36:192–236.
- Besag and Kooperberg (1995) Besag, J. and Kooperberg, C. (1995). On conditional and intrinsic autoregressions. Biometrika, 82(4):733–746.
- Boccardi et al. (2015) Boccardi, M., Bocchetta, M., Morency, F. C., Collins, D. L., Nishikawa, M., Ganzola, R., Grothe, M. J., Wolf, D., Redolfi, A., Pievani, M., et al. (2015). Training labels for hippocampal segmentation based on the EADC-ADNI harmonized hippocampal protocol. Alzheimer’s and Dementia, 11(2):175–183.
- Brown et al. (2017) Brown, D. A., Datta, G. S., and Lazar, N. A. (2017). A Bayesian generalized CAR model for correlated signal detection. Statistica Sinica, 27:1125–1153.
- Brown and McMahan (2017) Brown, D. A. and McMahan, C. S. (2017). Sampling Strategies for Fast Updating of Gaussian Markov Random Fields. ArXiv:1702.05518.
- Dempster (1972) Dempster, A. P. (1972). Covariance selection. Biometrics, 28:157–175.
- Dice (1945) Dice, L. R. (1945). Measures of the amount of ecologic association between species. Ecology, 26:297–302.
- Doshi et al. (2013) Doshi, J., Erus, G., Ou, Y., Gaonkar, B., and Davatzikos, C. (2013). Multi-atlas skull-stripping. Academic Radiology, 20(12):1566–1576.
- Dubois et al. (2007) Dubois, B., Feldman, H. H., Jacova, C., DeKosky, S. T., Barberger-Gateau, P., Cummings, J., Delacourte, A., Galasko, D., Gauthier, S., Jicha, G., et al. (2007). Research criteria for the diagnosis of Alzheimer’s disease: Revising the NINCDS–ADRDA criteria. The Lancet Neurology, 6(8):734–746.
- Eberly and Carlin (2000) Eberly, L. E. and Carlin, B. P. (2000). Identifiability and convergence issues for Markov chain Monte Carlo fitting of spatial models. Statistics in Medicine, 19:2279–2294.
- Fox and Nicholls (2001) Fox, C. and Nicholls, G. K. (2001). Exact MAP states and expectations from perfect sampling: Greig, Porteous and Seheult revisited. In Mohammed-Djafari, A., editor, AIP Conference Proceedings, volume 568, pages 252–263. AIP.
Gamerman, D. (1997).
Sampling from the posterior distribution in generalized linear mixed models.Statistics and Computing, 7:57–68.
- Gelfand and Smith (1990) Gelfand, A. E. and Smith, A. F. M. (1990). Sampling-based approaches to calculating marginal densities. Journal of the American Statistical Association, 85(410):398–409.
- Geman and Geman (1984) Geman, S. and Geman, D. (1984). Stochastic relaxation, Gibbs distributions and the Bayesian restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 6:721–741.
- Golub and Van Loan (1996) Golub, G. H. and Van Loan, C. F. (1996). Matrix Computations. The Johns Hopkins University Press, Baltimore, 3rd edition.
- Gonzalez et al. (2011) Gonzalez, J. E., Low, Y., Gretton, A., and Guestrin, C. (2011). Parallel Gibbs sampling: From colored fields to thin junction trees. In , pages 324–332.
- Greve and Fischl (2009) Greve, D. N. and Fischl, B. (2009). Accurate and robust brain image alignment using boundary-based registration. NeuroImage, 48(1):63–72.
- Hanson, J. L., Suh, J. W., Nacewicz, B. M., Sutterer, M. J., Cayo, A. A., Stodola, D. E., Burghy, C. A., Wang, H., Avants, B. B., and Yushkevich, P. A., et al. (2012) Hanson, J. L., Suh, J. W., Nacewicz, B. M., Sutterer, M. J., Cayo, A. A., Stodola, D. E., Burghy, C. A., Wang, H., Avants, B. B., and Yushkevich, P. A., et al. (2012). Robust automated amygdala segmentation via multi-atlas diffeomorphic registration. Frontiers in Neuroscience, 6:1–8.
- Hastings (1970) Hastings, W. (1970). Monte Carlo sampling methods using Markov chains and their application. Biometrika, 57:97–109.
- Heckemann et al. (2006) Heckemann, R. A., Hajnal, J. V., Aljabar, P., Rueckert, D., and Hammers, A. (2006). Automatic anatomical brain MRI segmentation combining label propagation and decision fusion. NeuroImage, 33(1):115–126.
- Iglesias and Sabuncu (2015) Iglesias, J. E. and Sabuncu, M. R. (2015). Multi-atlas segmentation of biomedical images: A survey. Medical Image Analysis, 24(1):205–219.
- Iglesias et al. (2012) Iglesias, J. E., Sabuncu, M. R., and Leemput, K. V. (2012). A generative model for probabilistic label fusion of multimodal data. In Multimodal Brain Image Analysis, pages 115–133.
- Iglesias et al. (2013) Iglesias, J. E., Sabuncu, M. R., and Leemput, K. V. (2013). Improved inference in Bayesian segmentation using Monte Carlo sampling: Application to hippocampal subfield volumetry. Medical Image Analysis, 17:766–778.
- Isgum et al. (2009) Isgum, I., Staring, M., Rutten, A., Prokop, M., Viergever, M. A., and van Ginneken, B. (2009). Multi-atlas-based segmentation with local decision fusion: Application to cardiac and aortic segmentation in CT scans. IEEE Transactions in Medical Imaging, 28(7):1000–1010.
- Jenkinson et al. (2002) Jenkinson, M., Bannister, P., Brady, J. M., and Smith, S. M. (2002). Improved optimisation for the robust and accurate linear registration and motion correction of brain images. NeuroImage, 17(2):825–841.
- Jenkinson et al. (2012) Jenkinson, M., Beckmann, C. F., Behrens, T. E., Woolrich, M. W., and Smith, S. M. (2012). FSL. NeuroImage, 62:782–790.
- Jenkinson and Smith (2001) Jenkinson, M. and Smith, S. M. (2001). A global optimisation method for robust affine registration of brain images. Medical Image Analysis, 5(2):143–156.
- Klein et al. (2005) Klein, A., Mensh, B., Ghosh, S., Tourville, J., and Hirsch, J. (2005). Mindboggle: Automated brain labeling with multiple atlases. BMC Medical Imaging, 5(1):7.
- Langerak et al. (2010) Langerak, T. R., van der Heide, U. A., Kotte, A. N. T. J., Viergever, M. A., van Vulpen, M., and Pluim, J. P. W. (2010). Label fusion in atlas-based segmentation using a selective and iterative method for performance level estimation (SIMPLE). IEEE Transactions on Medical Imaging, 29(12):2000–2008.
- Leung et al. (2010) Leung, K. K., Barnes, J., Ridgway, G. R., Bartlett, J. W., Clarkson, M. J., Macdonald, K., Schuff, N., Fox, N. C., Ourselin, S., et al. (2010). Automated cross-sectional and longitudinal hippocampal volume measurement in mild cognitive impairment and Alzheimer’s disease. NeuroImage, 51(4):1345–1359.
- Likeman et al. (2005) Likeman, M., Anderson, V. M., Stevens, J. M., Waldman, A. D., Godbolt, A. K., Frost, C., Rossor, M. N., and Fox, N. C. (2005). Visual assessment of atrophy on magnetic resonance imaging in the diagnosis of pathologically confirmed young-onset dementias. Archives of Neurology, 62(9):1410–1415.
- McCullagh and Nelder (1989) McCullagh, P. and Nelder, J. A. (1989). Generalized Linear Models. Chapman & Hall, London, second edition.
- Metropolis et al. (1953) Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. N., Teller, A. H., and Teller, E. (1953). Equation of state calculations by fast computing machines. Journal of Chemical Physics, 21:1087–1091.
- Müller (1991) Müller, P. (1991). A generic approach to posterior integration and Gibbs sampling. Technical Report, Purdue University.
- Ou et al. (2011) Ou, Y., Sotiras, A., Paragios, N., and Davatzikos, C. (2011). DRAMMS: Deformable registration vis attribute matching and mutual-saliency weighting. Medical Image Analysis, 15(4):622–639.
- Park et al. (2014) Park, M. T. M., Pipitone, J., Bear, J. H., Winterburn, J. L., Shah, Y., Chavez, S., Schira, M. M., Lobaugh, N. J., Lerch, J. P., Voineskos, A. N., et al. (2014). Derivation of high-resolution MRI atlases of the human cerebellum at 3T and segmentation using multiple automatically generated templates. NeuroImage, 95:217–231.
- Pham et al. (2000) Pham, D. L., C., X., and Prince, J. L. (2000). Current methods in medical image segmentation 1. Annual Review of Biomedical Engineering, 2(1):315–337.
- R Development Core Team (2008) R Development Core Team (2008). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0.
- Raamana et al. (2014) Raamana, P. R., Rosen, H., Miller, B., Weiner, M. W., Wang, L., and Beg, M. F. (2014). Three-class differential diagnosis among Alzheimer’s disease, frontotemporal dementia, and controls. Frontiers in Neurology, 5.
- Ramus and Malandain (2010) Ramus, L. and Malandain, G. (2010). Assessing selection methods in the context of multi-atlas based segmentation. In Proceedings of the IEEE International Symposium on Biomedical Imaging: From Nano to Macro, pages 1321–1324.
- Rohlfing et al. (2004) Rohlfing, T., Brandt, R., Menzel, R., and Maurer Jr., C. R. (2004). Evaluation of atlas selection strategies for atlas-based segmentation with application to confocal microscopy of bee brains. NeuroImage, 21(4):1428–1442.
- Rue and Held (2005) Rue, H. and Held, L. (2005). Gaussian Markov Random Fields. Chapman & Hall/CRC, Boca Raton.
- Sabuncu et al. (2010) Sabuncu, M. R., Yeo, B. T., van Leemput, K., Fischl, B., and Golland, P. (2010). A generative model for image segmentation based on label fusion. IEEE Transactions on Medical Imaging, 29(10):1714–1729.
- Smith and Fahrmeir (2007) Smith, M. and Fahrmeir, L. (2007). Spatial Bayesian variable selection with application to functional magnetic resonance imaging. Journal of the American Statistical Association, 102:417–431.
- Terenin et al. (2017) Terenin, A., Dong, S., and Draper, D. (2017). GPU-accelerated Gibbs sampling. ArXiv:1608.04329v3.
- Tustison and Avants (2013) Tustison, N. J. and Avants, B. B. (2013). Explicit B-spline regularization in diffeomorphic image registrations. Frontiers in Neuroinformatics, 7.
- Tustison et al. (2010) Tustison, N. J., Avants, B. B., Cook, P. A., Zheng, Y., Egan, A., Yushkevich, P. A., and Gee, J. C. (2010). N4ITK: Improved N3 bias correction. IEEE Transactions on Medical Imaging, 29(6):1310–1320.
- Wang et al. (2012) Wang, Y., Jia, H., Yap, P.-T., Cheng, B., Wee, C.-Y., Guo, L., and Shen, D. (2012). Groupwise segmentation improves neuroimaging classification accuracy. In Multimodal Brain Image Analysis, pages 185–193. Springer.
- Warfield et al. (2004) Warfield, S. K., Zou, K. H., and Wells, W. M. (2004). Simultaneous truth and performance level estimation (STAPLE): An algorithm for the validation of image segmentation. IEEE Transactions on Medical Imaging, 23(7):903–921.
- Yushkevich et al. (2006) Yushkevich, P. A., Piven, J., Hazlett, H. C., Smith, R. G., Ho, S., Gee, J. C., and Gerig, G. (2006). User-guided 3D active contour segmentation of anatomical structures: Significantly improved efficiency and reliability. NeuroImage, 31(3):1116–1128.
- Zhong et al. (2010) Zhong, J., Phua, D. Y. L., and Qiu, A. (2010). Quantitative evaluation of LDDMM, FreeSurfer, and CARET for cortical surface mapping. NeuroImage, 52(1):131–141.
7 Additional MCMC Details for Implementation
7.1 Posterior and Full Conditional Densities
The following full conditional distributions follow readily,
where , , , and
The remaining steps are to sample and the spatial random effects and . This is discussed in Subsection 2.3 of the manuscript.
8 Posterior sampling under the CMP
Under the CMP elicitation of Bedrick et al. (1996) with the logistic link, we add covariate-binomial reponse pairs, , where and , , are psuedo true statuses and linearly independent covariate vectors for the subpopulations, respectively. Unlike binomial observations, the values , are not necessarily integer-valued. The vector of transformed observations then has the following individual components
where . Similarly, define the diagonal matrix of weights to have diagonal components