neuron
Neural networks toolbox focused on medical image analysis
view repo
We consider the problem of segmenting a biomedical image into anatomical regions of interest. We specifically address the frequent scenario where we have no paired training data that contains images and their manual segmentations. Instead, we employ unpaired segmentation images to build an anatomical prior. Critically these segmentations can be derived from imaging data from a different dataset and imaging modality than the current task. We introduce a generative probabilistic model that employs the learned prior through a convolutional neural network to compute segmentations in an unsupervised setting. We conducted an empirical analysis of the proposed approach in the context of structural brain MRI segmentation, using a multi-study dataset of more than 14,000 scans. Our results show that an anatomical prior can enable fast unsupervised segmentation which is typically not possible using standard convolutional networks. The integration of anatomical priors can facilitate CNN-based anatomical segmentation in a range of novel clinical problems, where few or no annotations are available and thus standard networks are not trainable. The code is freely available at http://github.com/adalca/neuron.
READ FULL TEXT VIEW PDFNeural networks toolbox focused on medical image analysis
Biomedical image segmentation plays a crucial role in many applications, such as population analysis, disease progression modelling, or treatment planning. Convolutional neural networks (CNNs), a class of deep learning methods, have been employed to derive powerful biomedical segmentation algorithms, showing promise of overcoming limitations in previous methods
[3, 4, 29, 34]. However, CNN-based approaches most often depend on (large-scale) training data, particularly in the form of image scans paired with segmentations. These annotations are often costly and challenging to obtain because they require the tedious effort of a trained expert, taking several expert hours per scan.To our knowledge, there has not been a theoretically rigorous effort to integrate rich probabilistic anatomical priors with a CNN-based segmentation model in a computationally effective manner. We introduce a generative model for biomedical segmentation that employs a deep anatomical prior. We describe a principled derivation that follows directly from our generative model. We demonstrate that this yields intuitive cost functions and simpler models. We use an auto-encoding variational CNN to characterize the anatomical prior, and an encoder-decoder CNN to provide fast segmentation of medical images in unsupervised settings.
We demonstrate the method in an unsupervised biomedical image segmentation setting where paired annotations are not available. Our proposed strategy is general and computationally efficient, provides a natural framework for sampling possible subject-specific segmentations of a scan, and provides uncertainty estimates for these segmentations.
CNN-based segmentation approaches generally rely on fully convolutional architectures applied to image data. They extract hierarchical and multi-resolution features that are in turn combined to compute a semantic segmentation [23, 29, 31, 34].
A popular discriminative segmentation architecture, U-net [29]
, involves a convolutional encoder or downsampling network, followed by a convolutional decoder or upsampling network, and skip-connections between layers. The encoder captures relevant features of the input image at different resolutions. The decoder then synthesizes a high-resolution segmentation, using the skip connections to achieve voxel-level precision. While the exact architecture of these networks, such as the number of layers and levels, size of convolution kernels, or application of batch normalization vary, they typically involve millions of parameters and necessitate large datasets and data augmentation techniques to train.
CNN-based segmentation models have two major shortcomings: the dependency on annotated data, limiting their use in unsupervised settings; and their lack of anatomical knowledge. The latter limits the network’s ability to be faithful to known anatomical shapes during segmentation.
In our work, we use CNN architectures to learn anatomical priors and segment medical images. The prior eliminates the burden of providing paired example segmentations.
A clinical expert performing manual delineation relies on spatial coordinates and prior knowledge about anatomy, and may use a template of the structures to constrain the task. This process draws on the anatomical similarity across patient scans. This is in stark contrast with typical computer vision problems that have led to many popular CNN architectures, where object location, shape, and appearance can be unpredictable.
Convolutional methods are often limited in incorporating domain expertise. For example, U-Net [29] and its derivatives produce segmentation algorithms that do not exploit location information or other explicit anatomical priors. A CNN might have difficulty differentiating two distinct objects that are consistently in two specific parts of the scan, if they have the same intensity and context (as in bilateral structures in two hemispheres)^{1}^{1}1Assuming that the field of view of the network is constrained to not include the other object’s vicinity. While increasingly more complex networks that extend receptive fields may tease out object differences in supervised settings, the problem would be trivial if we consider anatomical knowledge like spatial location. Furthermore, in these modalities, image contrast can be weak or noisy in certain regions resulting in uncertainty of the segmentations. An anatomical prior can resolve these ambiguities, while making the segmentation task easier.
A popular strategy to explicitly employ prior structure in CNNs for biomedical image segmentation is to use a conditional random field (CRF) as a post-processing step [13, 30, 34]. However, CRFs only capture local constraints, and add to the computational burden. Location information has been included as a feature in patch-based CNN segmentation networks [34]. While this addition carries prior location information, it is network-specific, increases the parameter burden on the network, and does not capture shape information.
Recent methods have employed shape priors for neural network solutions in supervised problems [26, 28]. In particular, they often design a series of networks that learn representations of images and segmentations in a supervised setting. They propose ad-hoc cost functions that encourage the computed segmentations to be similar to both the learned shape and the ground truth. These methods attempt to correct segmentations produced by standard CNNs by adding a prior constraint.
Convolutional image generative models, such as generative adversarial nets, have grown in popularity. They have recently been applied to biomedical image segmentations [16, 24]
in a supervised setting where standard loss functions are combined with adversarial losses. A series of recent papers in the computer vision community removes the requirement for paired data by introducing a cycle dependency
[37]. However, these methods are less applicable in medical image segmentation with many anatomical labels, as an image signal can pass through the rich networks at low cost, leading to a perfect cycle loss, circumventing the required constraints [37].Variational Bayes auto-encoders have been used for various tasks to learn probabilistic generative models, and often use convolutional networks [18]. Our method builds on these models to combine anatomical priors with image generation.
Encoding and exploiting prior knowledge is common in generative models. Our inspiration comes from classical atlas-based probabilistic segmentation methods that estimate the maximum a posteriori
(MAP) probability based on a generative model involving a prior probability and likelihood
[5, 10, 17, 27, 32, 33, 35].The prior term captures knowledge of underlying anatomy and usually involves a probabilistic atlas and a spatial deformation that models geometric variation. The spatial deformation can be explicitly solved using a registration algorithm or accounted for in a unified segmentation framework [1].
The likelihood models the physical process that yields medical image intensities, sometimes called the appearance model, conditioned on the latent anatomy. These appearance models are often simpler, relying on additive and/or multiplicative Gaussian or Rician noise models [35]. Model parameters are most often estimated using training data, such as annotated image pairs, for example using maximum likelihood.
Given a new image, most popular segmentation algorithms use numerical non-convex optimization and can take several hours per image on a modern CPU.
In our model, we draw on ideas from classical model-based biomedical segmentation algorithms, convolutional neural networks (CNNs) used in semantic segmentation, and recent developments in variational Bayes approximations using neural networks. In our experiments, we consider the segmentation of structural brain MRI scans into cortical and subcortical regions of interest (ROIs). Our results show that the proposed anatomical prior enables rapid unsupervised segmentation. While complex, specialized tools exist for segmenting some specific scan modalities or particular diseases, they do not generalize to other modalities and can take hours to process one scan. Our goal is to provide a first general approach to biomedical image segmentation in an unsupervised setting.
We let be an (MR) 3D volume, and assume it is generated from a 3D anatomical segmentation map . We will use and to denote the image intensity and label at voxel , respectively.
We use a generative model to describe the spatial distribution, shape, and appearance of anatomical structures. Figure 1 provides a graphical representation.
The prior captures our knowledge about spatial distributions and shape of anatomy. We let be a latent variable representing an embedding of these shapes, and model the prior probability of this embedding as normal with mean and an identity covariance matrix:
(1) |
where is the normal distribution parametrized by mean and covariance .
We let be drawn from a categorical prior distribution determined by the low-dimensional embedding via :
(2) |
where is the probability of label at voxel .
Finally, given the label map , the intensity observations are generated via , sampled at each voxel from a normal distribution:
(3) |
where , and is the indicator function that evaluates to 1 if and 0 otherwise. The joint likelihood is therefore , where . Intuitively, the embedding determines the possible anatomical shapes in , which in turn determine the possible observed images .
We describe the learning procedure in the next section. Given learned parameters, to obtain the segmentation given a new image , we perform MAP estimation:
(4) |
In this section, we describe a learning strategy that uses convolutional neural networks to estimate anatomical representations and optimize posterior segmentation distributions. This procedure is applicable to broad modelling choices for the probability distributions described above. We also discuss a separate learning procedure for the anatomical prior, uncertainty estimation, and implementation.
Without assuming voxel independence of the segmentation map given an image, estimating the posterior probability
is intractable since it involves integrating over the latent variable . Estimatingis similarly intractable, making the Expectation Maximization algorithm not pertinent.
We first introduce an encoding probability as an approximation to the intractable , similar to [18]. Consider the KL divergence between the approximate distribution and the true posterior :
KL | ||||
(5) | ||||
Rearranging terms, we obtain
(6) |
Since the KL divergence of the approximate and true posterior of is non-negative, the second term is referred to as the variational lower bound of the model evidence or joint probability. For a given approximate distribution , we can estimate by optimizing the lower bound:
(7) |
We model the approximating posterior as a normal that depends on the image only:
(8) |
where is diagonal.
We estimate the parameters of the approximating distribution using convolutional neural networks. We design an encoding convolutional neural network that takes as input and outputs the parameters of the approximating posterior distribution , and . This network learns how to embed an entire (MR) image into the most likely low-dimensional anatomical embedding
and its variance.
Conditioned on , the probability of the segmentation can be computed with a decoder network that takes as input and outputs the parameters of the segmentation categorical distribution . The parameters of this decoder can be learned using a separate set of segmentations, as described below.
The final part of the generative model, the appearance or likelihood model, can also be learned with a neural network that takes a segmentation probability map as input and computes the parameters . We separately estimate , assuming additive zero mean Gaussian noise in an image, using a difference of Laplacian filters [15].
In this work, we learn a prior independently from an unpaired segmentation dataset. This enables the flexibility of having an external description of the anatomy that need not be available in the current data. Unfortunately, as before, estimating the probability distribution is intractable. Following a derivation similar to the previous section and to the auto-encoding variational Bayes framework, we introduce an approximation to the posterior as a normal distribution:
(9) |
where is diagonal, leading to the following lower bound:
(10) |
This can be optimized using a Stochastic Gradient Variational Bayes (SGVB) estimator that uses mini-batches. The reparametrization trick enables us to sample , leading to an approximation of the expectation [18]. The loss for each data point and sample is
(11) |
We design an encoding network that takes a segmentation map as input and outputs the parameters and . Importantly, we learn the parameters of the encoding network given only a set of segmentations , which can be derived from other imaging modalities and/or datasets. The segmentation prior therefore does not require paired training data in the traditional sense. For example, we can use a prior computed using publicly available annotated datasets such as [19] in a problem that involves a different imaging modality than in the current task.
We assume we have learned a segmentation prior using the Auto-Encoding Anatomical Prior described in the previous section. In particular, we will utilize the decoder component of the prior model, namely .
If we had annotated pairs , we could jointly learn model parameters , and variational parameters by optimizing the evidence lower bound objective (7), similar to the previous section. For each sample and sample , the loss function would be
(12) | ||||
resulting in terms of KL divergence, segmentation map categorical cross-entropy, and intensity-based mean squared error, respectively. During training, these terms would ensure that the probability stays close to the standard normal, while explaining the segmentations, and that the model parameters capture the relationship between the segmentations and the images.
However, in this paper we tackle the unsupervised setting, where annotated pairs are not available, and we only have the images . Therefore, we cannot compute the categorical cross entropy term in (12). Instead, we marginalize over the segmentation in the second term of the variational lower bound (7):
(13) |
where we used Jensen’s inequality. We therefore arrive at the following upper bound of the loss function:
(14) |
where we used the factorization of over voxels from (2), and sample .
Given a new image , we approximate by first obtaining using the encoder , and taking the maximum segmentation at each voxel
. The operations are fast, since both are feed-forward neural networks.
This model also enables sampling segmentations conditioned on a particular image and enables estimation of uncertainty. Given an input image , we can create samples and , simulating different plausible segmentations for a given subject. We can estimate the uncertainty of our segmentation given a new image using
(15) | ||||
A CNN can be seen as a hierarchical function, a set of concatenated functions, or layers. For example, CNNs often map some input image to an output probability :
(16) |
where denotes concatenation,
is often some nonlinear function such as a rectified linear unit or ReLU or max-pool
[12] applied to (linear) convolutions of the output of the previous layer (with ).Although we operate on 3D images, we use a 2D architecture in our experiments. We experimented with 3D architectures as well, but found little gain while facing significant challenges related to limitations between GPU memory, batch size, the number of features, and the number of labels possible. Each encoder consists of five downsampling levels of one convolution layer each, with 3x3 convolution kernels with elu activations, and 32 features for each kernel. The final layer is dense, with
-long encoding of the means and standard deviations representations.
The decoder is a mirror of this design, but upsamples instead of downsampling and ends with a sigmoid activation. In addition, we use a final layer that implements a pixel-wise spatially-varying voxel-wise (location) prior , which is multiplied with dec (in practice, we add the logarithms). As is common in the atlas-based segmentation literature, the prior was computed as the frequency of labels in the held out prior dataset, in affine-normalized coordinate system. This layer discourages any extreme decodings of but does not capture shape properties, which is encoded in dec.
We implement the normal probability with a single linear layer. We also find it useful to pre-train the image encoder using an image variational auto-encoder similar to the segmentation one. The encoder weights are used as initialization only. During training, we used the Adadelta optimizer [36].
For the latent encoding layers representing and
, we introduce an activation function that discourages the sample activations from being too large, helping limit numerical issues stemming in sampling from these layers during the reparametrization trick. We use concepts from the softsign and tanh activations to define our function as
.We demonstrate our model on two datasets. For the first dataset, we obtain ground truth segmentations using a specialized algorithm with intense computational requirements, combined with manual work and QC [9]. We use a subset of these segmentations to learn the prior probability parameters. We treat the rest of the dataset as unsupervised, where we only use the ground truth segmentations as validation. For the second dataset we do not have ground truth, offering a realistic scenario. Figure 3 shows example images from the two datasets, highlighting the difference, and the difficulty of the task.
We gathered a large-scale multi-site, multi-study dataset of more than 14,000 T1-weighted brain MRI scans from eight publicly available datasets: including data from ADNI [25], OASIS [20], ABIDE [7], ADHD200 [22], MCIC [11], PPMI [21], HABS [6], and Harvard GSP [14]. Subject age ranges, health states, and acquisition details vary with each dataset, but all scans were resampled to a 256x256x256 grid with 1mm isotropic voxels, and all images cropped to 160x192x224 to eliminate entirely-background voxels.
We carry out standard pre-processing steps, including affine spatial normalization using FreeSurfer for each scan [9]. All MRIs were also segmented with FreeSurfer - a task that takes several CPU hours per scan. We also applied quality control (QC) using visual inspection to catch gross errors in segmentation results.
We partitioned the data into a prior training subset of 5,000 images, where we only used the annotations. The rest of the data was treated as an unannotated dataset, where QCed segmentations were only used for validation.
While developing the network architectures we partitioned the rest of the data into training, validation and test sets. Once the architecture was fixed, we reported results on the test dataset by training and evaluating the model in an unsupervised fashion.
We also gathered a dataset of more than 3800 T2-FLAIR scans, a significantly different MR modality, from the ADNI cohort. These scans exhibit significantly different tissue properties compared to the T1w images, lower acquisition quality, and exhibit mm slice spacing (Figure 3). They provide a good test of our hypothesis that priors learned are useful for segmenting image data with different tissue properties. To our knowledge there is no automatic method to obtain detailed anatomical segmentations for these images. We affinely align these images to the same space as the T1 images using mutual information based affine registration with ANTs [2]. We perform brain extraction using an in-house developed neural network-based algorithm that uses a UNet architecture and extensive data augmentation.
In the set of annotations that we used to train the prior, we avoided including any annotations coming from ADNI subjects whose T2-FLAIR scans are in this dataset.
We evaluate our results both visually and quantitatively. For the T1w dataset, we use a volume overlap measure, Dice, to quantify the automatic segmentation results [8]:
(17) |
where is the predicted segmentation map, and indicates the ground truth (FreeSurfer) label at each location. A Dice score of 1 indicates perfect segmentation.
We experimented segmenting in the unsupervised setting with standard UNet architectures, using the image MSE and mutual information loss functions. Because of the many structures that share similar intensities, these architectures are not able to produce sensible segmentations that resemble the correct segmentations, and we omit them from these results. Classical unsupervised methods that include sophisticated prior anatomical information take a significant amount time to run, and for T1w we regard FreeSurfer results as an optimistic bound for the T1w data. However, as these methods tend to be focused on specific modalities, there is no annotation tool for cortical and subcortical regions in T2-FLAIR. We evaluate the T2-FLAIR segmentation visually in Figure 5.
At test time, a new subject only needs to be affinely registered to a template, after which the proposed CNN model evaluates a segmentation estimate. The entire process takes less than a few seconds on an NVidia Titan X GPU.
Fig. 4 shows a series of example segmentations for the T1w dataset demonstrating that our method is able to estimate approximate anatomical structures, reproducing the general location as well as the shape of structures. Fine details, such as details of cortex folding, is not easily captured by the prior encoding, leading to smooth segmentation predictions. Fig. 6 illustrates the average Dice measure across several anatomical regions for T1 scans. We focus on the most prevalent (larger) structures, which can also be evaluated in detail in the visualizations of Figure 4.
Fig. 5 shows results of our algorithm on T2-FLAIR scans. Even with the significantly lower image quality and different tissue contrasts, our algorithm is able to use the prior information to predict plausible segmentations, even given challenging images in this unsupervised scenario.
Our method is able to detect anatomical structures that are guided by image contrast while respecting anatomical shapes according to the prior. Rapid, zero-shot segmentation is a challenging task, and has not been tackled by previous methods. As such, the absence of prior results makes it difficult to fully interpret current results. The detailed FreeSurfer results are an upper bound, which any model is unlikely to achieve in the unsupervised setting. We omit showing results from lower bound (simplistic) baselines, such as the unsupervised U-Net model described above, since these models yielded nonsensical segmentations. To the best of our knowledge, our results are the first for zero-shot neural-network based segmentation of brain structures.
In this paper, we introduced a generative probabilistic model that employs a prior model learned through a convolutional neural network to compute segmentations in an unsupervised setting. We can interpret the anatomical prior as encouraging the neural network to predicting segmentation maps that come from a known distribution characterized by while simultaneously producing images that agree with the observed scan. We demonstrate that our model enables segmentation using convolutional networks leading to rapid inference in a setting where segmentation is traditionally not possible, or takes hours to obtain for a single scan. The integration of priors promises to facilitate accurate anatomical segmentation in a variety of novel clinical problems with limited dataset availability.
Thirtieth AAAI conference on artificial intelligence
, 2016.