Medical image segmentation is inherently an ambiguous task due to factors such as partial volumes and variations in anatomical definitions. While in most cases the segmentation uncertainty is around the border of structures of interest, there can also be considerable inter-rater differences. The class of conditional variational autoencoders (cVAE) offers a principled approach to inferring distributions over plausible segmentations that are conditioned on input images. Segmentation uncertainty estimated from samples of such distributions can be more informative than using pixel level probability scores. In this work, we propose a novel conditional generative model that is based on conditional Normalizing Flow (cFlow). The basic idea is to increase the expressivity of the cVAE by introducing a cFlow transformation step after the encoder. This yields improved approximations of the latent posterior distribution, allowing the model to capture richer segmentation variations. With this we show that the quality and diversity of samples obtained from our conditional generative model is enhanced. Performance of our model, which we call cFlow Net, is evaluated on two medical imaging datasets demonstrating substantial improvements in both qualitative and quantitative measures when compared to a recent cVAE based model.READ FULL TEXT VIEW PDF
Segmentation of anatomical structures and pathologies is inherently
Deep learning motivated by convolutional neural networks has been highly...
Medical imaging only indirectly measures the molecular identity of the t...
Since the advent of U-Net, fully convolutional deep neural networks and ...
Quantifying segmentation uncertainty has become an important issue in me...
Fully convolutional neural networks (FCNs), and in particular U-Nets, ha...
We propose OPAL-Net, a novel hierarchical architecture for part-based la...
Medical image segmentation is inherently an ambiguous task and segmentation methods capable of quantifying uncertainty by inferring distributions over segmentations are therefore of substantial interest to the medical imaging community [wilson1988image, kendall2017uncertainties, jensen2019improving]. Estimating uncertainty from distributions over segmentations is closer to the clinical settings than pixel-wise uncertainty estimates, where whenever feasible multiple expert opinions are used to ascertain downstream clinical decisions. Such consensus based decisions not only account for the aleatoric (inherent) and epistemic (modeling) uncertainties but also explain the inter-rater variability that is largely inevitable in medical image segmentation.
Bayesian deep learning methods offer a natural setting to infer distributions over segmentations. This has been explored to some extent for medical image segmentation in the spirit of Monte Carlo estimation where multiple hypotheses are explored by predicting segmentation masks with different dropout rates [gal2016dropout] or with an ensemble of models [rupprecht2017learning]. These methods can output a fixed number of samples with pixel level probability scores which can be a limitation.
Conditional variational autoencoders (cVAE) [sohn2015learning] belong to the class of conditional generative models. cVAEs can be used to obtain an unlimited number of predictions by sampling from a latent space conditioned on the input images. This model was recently adapted for medical image segmentation as the probabilistic U-Net (Prob. U-Net) [kohl2018probabilistic] demonstrating the possibility of generating large number of plausible segmentations. The Prob. U-Net model fuses an additional channel obtained from the latent space to the final layer (at the highest resolution) of U-Net to obtain a variety of albeit less diverse and blurry segmentations when compared to the raters. Quite recently, two models have sought to improve upon the Prob. U-Net [baumgartner2019phiseg, kohl2019hierarchical]. Both these methods hypothesize that the blurriness and lack of diversity observed in samples obtained from Prob. U-Net is caused by the use of a single latent variable at the highest resolution. They propose to multiple several latent variables in a hierarchical fashion operating at different resolutions to make the model more expressive and demonstrate this to be helpful.
In this work, we focus on obtaining expressive latent representations that can yield diverse segmentations within the cVAE setting. While we agree with [baumgartner2019phiseg, kohl2019hierarchical] that Prob. U-Net suffers from using the latent representation at a single resolution, we argue that it can be alleviated by using a more expressive latent posterior distribution instead of using multiple latent variables in a hierarchical setting. This arises from the fact that all cVAE type models, including Prob. U-Net, use an axis aligned Gaussian as the latent distribution which can be limiting when approximating a complex latent posterior distribution [rezende2015variational, kingma2016improved]. We propose to improve the approximation of the latent posterior distribution with conditional normalizing flows (cFlow) which can yield arbitrarily complex distributions starting from simple ones. We demonstrate that these complex distributions operating at a single resolution are able to capture richer diversity of realistic segmentations. We propose a novel conditional normalizing flow model – cFlow Net – and demonstrate the use of two types of normalizing flow transformations: Planar flows [rezende2015variational] and Generative Flows [kingma2018glow]. We evaluate the method on two medical imaging datasets: LIDC-IDRI [armato2004lung] for detecting lesions in lungs from chest CT and for detecting retina blood vessels from on a new Retinal Vessel dataset created from three older datasets [staal2004ridge, hoover2000locating, owen2011retinal]. We compare the performance of our model with Prob. U-Net and demonstrate that cFlow Net shows significant improvements on both quantitative (generalized energy distance and dice) and qualitative measures.
Image segmentation tasks can be formulated in a conditional generative model setting with the objective of estimating the conditional distribution , where and are the input images and corresponding binary segmentations, respectively of dimensions with channels. This has been approached using the conditional VAE formulation where the conditional distribution is approximated by introducing dependency on a -dimensional latent variable [sohn2015learning, kohl2018probabilistic], as shown in Figure 1 (center).
The cVAE objective minimizes the KL divergence between the true latent posterior distribution and its variational approximation resulting in an objective of the form [sohn2015learning]:
The first term is the expected conditional log-likelihood (CLL) under the variational distribution and the second term can be seen as the regularization forcing the posterior distribution to match the conditional prior distribution . In cVAE, the posterior density is modeled as a diagonal Gaussian density for tractability reasons: . The mean
and varianceare predicted using an encoder network parameterized by . The decoder and prior networks are parameterized by and respectively.
Normalizing flows can be used to transform simple base distributions into complex ones using a sequence of bijective transformations (the flow chain) with easy to compute Jacobians [rezende2015variational, papamakarios2019normalizing]. They basically extend the change of variable rule to transform a base distribution into a target distribution in successive steps. Normalizing flows can transform a simple base distribution into an arbitrarily complex target distribution, , by composing complex flow transformations with simpler flow steps [papamakarios2019normalizing].
Consider one such bijective transformation composed of steps:
Forward evaluation of this flow chain, transforming , can be written as:
where is distributed according to the base distribution .
Reverse evaluation of the flow chain, transforming , can be written as:
The transformed distribution, , is obtained from the base distribution, , adjusted by the inverse absolute Jacobian determinant of the flow transformation. For a single flow step :
where denotes the Jacobian determinant. The complete transformation using the full flow chain in log domain is given by
where the last equality follows from .
When using cVAE-like models for medical image segmentation tasks, it is assumed that the diversity of segmentations is captured with the latent posterior distribution. However, using a simple distribution, such as an axis-aligned Gaussian, to approximate the latent posterior distribution can be too restrictive and might not be sufficiently expressive to capture richer variations. This is noticeable in the Prob. U-Net model [kohl2018probabilistic] where the segmentations are blurry and lack diversity [baumgartner2019phiseg, kohl2019hierarchical]. It is in this context that normalizing flows can be used to improve the flexibility of the approximate posterior density to capture a richer diversity of high quality segmentations.
If we denote the approximate posterior density output by the encoder network as the base distribution, , using the latent variable , then using the idea of normalizing flows in Section 2 can yield more expressive posterior densities. If the base distribution is transformed using a flow chain of steps according to Eq. (2), then the transformed distribution after steps with can be written using Eq. (6) as:
It can be shown that the modified objective for the conditional flow-based model becomes (see Section 6.1 in Supplementary material for details):
Note that the expectation is with respect to the base distribution of the normalizing flow . The KL divergence is similar to the term for cVAE in Eq. 1 except for an additional term due to the log determinant of the Jacobian terms in Eq. (7).
In this work we use planar flows introduced in [rezende2015variational] modified to be conditioned on theinput image with each step of the flow:
are learnable parameters predicted by a conditioning neural networksimilar to the conditioning network used in [lu2019structured], is the dimensionality of the latent space and is an element-wise non-linearity such as tanh with derivative . The Jacobian determinant for the planar flow step is given by
The conditioning on the flow chain is introduced through the context vector which is dependent on . The context vector is also predicted by the encoder network. The proposed cFlow Net model is visualized in Figure 2.
Note that at inference, to sample multiple segmentations only the Sampling Process part of the model is used. Given an image , the prior network can be used to obtain multiple latent variable samples which are then decoded by the decoder network to output multiple segmentations for the input image.
All experiments are performed on two publicly available datasets. Both datasets comprise labels from at least two raters used to quantify the performance of all models. We use a training-validation-test split of 60:20:20 for both datasets.
LIDC-IDRI dataset: The LIDC-IDRI dataset consists of thoracic CT scans with four raters annotating the lesions in them [armato2004lung]. We use patches of size centered on lesions similar to the procedures followed in [kohl2018probabilistic, baumgartner2019phiseg] to obtain patches in total. The preprocessed data is obtained from [lidc].
Retinal Vessel dataset: As a secondary dataset we create a new dataset derived from three older retinal vessel segmentation datasets: DRIVE [staal2004ridge], STARE [hoover2000locating] and CHASE [owen2011retinal]. Each of these datasets has a subset of images with labels from two raters. We collected images with two raters from these three datasets, extracted retinal masks when there weren’t any and resized them such that all images are of height px. This yields images of which are of size px and the remaining are px. All images have vessel annotations from two raters.(Figure 4 in Supplementary material).
The proposed cFlow Net model is compared with the probabilistic U-Net [kohl2018probabilistic], and additionally with the deterministic U-Net [ronneberger2015u] for the single rater setting. Other than the cFlow Net model described in Section 3 with planar flows [rezende2015variational], we additionally report the cFlow model with conditional generative flow model which uses the Glow transformation steps [kingma2018glow, lu2019structured] (Section 6.2 in Supplementary material).
Performance of the models in the multiple annotator setting is evaluated based on the generalized energy distance () which captures the diversity of samples obtained from the generative models when compared to the annotators. It is given by
where are samples from the ground truth distribution, , comprising different raters, are samples from the generative distribution, , learned by the model and is 1-IoU ( intersection-over-union) measure. Additionally, we report the negative conditional log likelihood (-CLL approximated with samples (Section 6.3 in Supplementary material) and the dice accuracy for the single rater settings.
|Models||LIDC Dataset||Retina Dataset|
|All Raters||Single Rater||All Raters||Single Rater|
|cFlow Net (Planar)||47.3||0.204||89.0||0.288||0.713||4.436||0.884||4.482||0.877||0.632|
|cFlow Net (Glow)||49.2||0.302||217.0||0.547||0.704||4.482||0.901||4.488||0.878||0.620|
Both variants of the cFlow Net models use flow steps. The decoder network in the cFlow Net and Prob. U-Net was a deterministic U-Net with 4 resolutions identical to the ones used in [kohl2018probabilistic]. Architectures of both encoder and prior networks were similar to the encoding path of the decoder network. In addition to predicting the mean and variance , the encoder network in the cFlow Net model outputs a context vector of dimension which is input to the flow transformation block as illustrated in Figure 2. The conditioning network
is a three layered multi-layer perceptron (MLP) with 8 hidden units. Latent space dimension ofwas used for the Prob. U-Net and the cFlow Net models. All the models were trained using a batch size of and a learning rate of with the Adam optimizer [kingma2014adam]. The models were trained for a maximum of epochs and training convergence was assumed when there was no improvement in validation loss for epochs. Models with the best validation loss was used to evaluate the performance on test set reported in Table 1
. The experiments were run using PyTorch[paszke2019pytorch]222Open source implementation of our model will be made available here: https://github.com/raghavian/cFlow on a single Tesla K80 GPU with 12GB memory. The computation time for both variants of the cFlow Net models on LIDC dataset was s, and about s on the Retinal Vessel dataset per training epoch.
Performance of all the models on test set of both the datasets are reported in Table 1. Within each dataset we report the performance when compared to All Raters and a Single Rater
. Statistically significant improvement in performance (based on paired sample t-tests with) when compared to other models are highlighted in bold.
The proposed cFlow Net (Planar) model is consistently better than the baseline Prob. U-Net model on the LIDC dataset in and -CLL measures. The performance of the cFlow Net (Planar) model in the Single Rater setting shows a large improvement when compared to Prob. U-Net model. This is also demonstrated in Figure 3 seen as more realistic and diverse samples generated only by training only on a single (the first) rater. There is a small reduction in performance of all the conditional generative models when compared to the Det. U-Net model in dice accuracy.
The significant improvements in for the cFlow Net models reported in Table 1 are also reflected qualitatively in the samples shown in Figure 3. Samples from cFlow Net (row 2) are not only able to capture the variations amongst all four raters (row 1) but the remainder samples appear plausible. When trained with a single rater (row 3), the cFlow Net model is still able to capture a richer diversity of segmentations. As annotations are available from only a single rater in majority of applications, this behaviour of the cFlow Net of being able to capture diverse segmentations from single rater is desirable. This is in contrast with the samples from Prob. U-Net even when trained with all raters (row 4), where the samples appear blurry and are unable to reflect the diversity of the four raters. This lack of diversity becomes more pronounced when trained with a single rater, as the Prob. U-Net model outputs almost identical looking samples (row 5).
In the last column of Figure 3 we also show the mean prediction obtained from samples of each model (brighter regions have higher probability). The mean predictions from the cFlow Net model trained on a single rater could be more informative than the mean prediction from Prob. U-Net trained on a single rater. This further strengthens our argument that improving the approximation to the latent posterior distribution with conditional normalizing flows helps capture meaningful uncertainty with the possibility of sampling unlimited number of diverse segmentations.
A similar trend is also observed with the Retinal Vessel dataset. This is a far more challenging dataset as the images are acquired differently and the quality of annotations vary between the six annotators. This is captured as higher and -CLL across all models. Even within this setting, the cFlow Net models fare better than the Prob. U-Net model in both the single and multiple rater experiments. There was no significant difference in dice accuracy between any of the methods indicating the stochastic generative components of the proposed models do not affect segmentation accuracy.
We proposed a novel conditional generative model based on conditional normalizing flows to quantify uncertainty in segmentations. The use of cFlow steps improved the approximation of the latent posterior distribution, captured in the smaller negative conditional log likelihood values and also manifested in the diversity of samples. The primary contribution in this work is the incorporation of conditional normalizing flows for handling high dimensional data such as medical images. Theflow transformation block is modular and can be easily replaced with any suitable normalizing flow providing access to a rich class of improved conditional generative models [papamakarios2019normalizing]. We demonstrated this feature of cFlow Net with two types of normalizing flow transformations: Planar [rezende2015variational] and Glow [kingma2018glow] with promising performance.
We start with the motivation of approximating the latent posterior distribution with a variational distribution and minimizing the reverse KL divergence,
The last equality is due to Bayes’ Rule. A little rearranging yields,
Note the first term is and the first two terms form a lower bound on the conditional likelihood. Thus,
The bound is equal to the conditional likelihood when the KL divergence between the variational distribution and the true posterior is zero. This is the reason we can optimize a surrogate objective such as the bound on the conditional likelihood to indirectly optimize the KL divergence.
The negative of the bound on the conditional log likelihood is the standard objective in a cVAE given in Eq. (1). With
steps of normalizing flows, we can factorise the random variable at stepwith first order Markov assumption as
The transformed densities at each step are related by the flow transformation in Eq. (2)
Following this, the full factorisation with can be written as:
We also demonstrated the performance of cFlow Net model which uses generative flow (Glow) model [kingma2018glow] with conditioning to transform the base distribution. We followed a strategy similar to [lu2019structured] to obtain conditional Glow steps wherein we use a neural network which takes the context vector as input to predict the parameters of the Glow steps.
Each Glow step comprises three sub-steps and transforms the random variable conditioned on the context vector into :
Sub-step 1. ActNorm:
Sub-step 2. 1x1 Convolution:
Sub-step 3. Affine Coupling:
with , and are MLPs, are intermediate transformed variables and is the transformed variable after one complete Glow step.
The the conditional log likelihood reported in Table 1 is obtained by marginalizing over the latent variable approximated using a Monte Carlo estimate with samples for each image in the test set: