Uncertainty quantification in medical image segmentation with Normalizing Flows

06/04/2020 ∙ by Raghavendra Selvan, et al. ∙ Københavns Uni 15

Medical image segmentation is inherently an ambiguous task due to factors such as partial volumes and variations in anatomical definitions. While in most cases the segmentation uncertainty is around the border of structures of interest, there can also be considerable inter-rater differences. The class of conditional variational autoencoders (cVAE) offers a principled approach to inferring distributions over plausible segmentations that are conditioned on input images. Segmentation uncertainty estimated from samples of such distributions can be more informative than using pixel level probability scores. In this work, we propose a novel conditional generative model that is based on conditional Normalizing Flow (cFlow). The basic idea is to increase the expressivity of the cVAE by introducing a cFlow transformation step after the encoder. This yields improved approximations of the latent posterior distribution, allowing the model to capture richer segmentation variations. With this we show that the quality and diversity of samples obtained from our conditional generative model is enhanced. Performance of our model, which we call cFlow Net, is evaluated on two medical imaging datasets demonstrating substantial improvements in both qualitative and quantitative measures when compared to a recent cVAE based model.



There are no comments yet.


page 8

page 12

Code Repositories

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Medical image segmentation is inherently an ambiguous task and segmentation methods capable of quantifying uncertainty by inferring distributions over segmentations are therefore of substantial interest to the medical imaging community [wilson1988image, kendall2017uncertainties, jensen2019improving]. Estimating uncertainty from distributions over segmentations is closer to the clinical settings than pixel-wise uncertainty estimates, where whenever feasible multiple expert opinions are used to ascertain downstream clinical decisions. Such consensus based decisions not only account for the aleatoric (inherent) and epistemic (modeling) uncertainties but also explain the inter-rater variability that is largely inevitable in medical image segmentation.

Remarkable strides in supervised medical image segmentation have been made with deep learning methods 

[ronneberger2015u, cciccek20163d, zhou2019review]. These methods, however, provide point estimates of segmentations – meaning a single segmentation mask per image – which limits our ability to quantify the uncertainty of said segmentations.

Bayesian deep learning methods offer a natural setting to infer distributions over segmentations. This has been explored to some extent for medical image segmentation in the spirit of Monte Carlo estimation where multiple hypotheses are explored by predicting segmentation masks with different dropout rates [gal2016dropout] or with an ensemble of models [rupprecht2017learning]. These methods can output a fixed number of samples with pixel level probability scores which can be a limitation.

Conditional variational autoencoders (cVAE) [sohn2015learning] belong to the class of conditional generative models. cVAEs can be used to obtain an unlimited number of predictions by sampling from a latent space conditioned on the input images. This model was recently adapted for medical image segmentation as the probabilistic U-Net (Prob. U-Net) [kohl2018probabilistic] demonstrating the possibility of generating large number of plausible segmentations. The Prob. U-Net model fuses an additional channel obtained from the latent space to the final layer (at the highest resolution) of U-Net to obtain a variety of albeit less diverse and blurry segmentations when compared to the raters. Quite recently, two models have sought to improve upon the Prob. U-Net [baumgartner2019phiseg, kohl2019hierarchical]. Both these methods hypothesize that the blurriness and lack of diversity observed in samples obtained from Prob. U-Net is caused by the use of a single latent variable at the highest resolution. They propose to multiple several latent variables in a hierarchical fashion operating at different resolutions to make the model more expressive and demonstrate this to be helpful.

Figure 1: Graphical model view of VAE and variations to it including the proposed cFlow Net (right)

In this work, we focus on obtaining expressive latent representations that can yield diverse segmentations within the cVAE setting. While we agree with [baumgartner2019phiseg, kohl2019hierarchical] that Prob. U-Net suffers from using the latent representation at a single resolution, we argue that it can be alleviated by using a more expressive latent posterior distribution instead of using multiple latent variables in a hierarchical setting. This arises from the fact that all cVAE type models, including Prob. U-Net, use an axis aligned Gaussian as the latent distribution which can be limiting when approximating a complex latent posterior distribution [rezende2015variational, kingma2016improved]. We propose to improve the approximation of the latent posterior distribution with conditional normalizing flows (cFlow) which can yield arbitrarily complex distributions starting from simple ones. We demonstrate that these complex distributions operating at a single resolution are able to capture richer diversity of realistic segmentations. We propose a novel conditional normalizing flow model – cFlow Net – and demonstrate the use of two types of normalizing flow transformations: Planar flows [rezende2015variational] and Generative Flows [kingma2018glow]. We evaluate the method on two medical imaging datasets: LIDC-IDRI [armato2004lung] for detecting lesions in lungs from chest CT and for detecting retina blood vessels from on a new Retinal Vessel dataset created from three older datasets [staal2004ridge, hoover2000locating, owen2011retinal]. We compare the performance of our model with Prob. U-Net and demonstrate that cFlow Net shows significant improvements on both quantitative (generalized energy distance and dice) and qualitative measures.

2 Background & Problem Formulation

Image segmentation tasks can be formulated in a conditional generative model setting with the objective of estimating the conditional distribution , where and are the input images and corresponding binary segmentations, respectively of dimensions with channels. This has been approached using the conditional VAE formulation where the conditional distribution is approximated by introducing dependency on a -dimensional latent variable  [sohn2015learning, kohl2018probabilistic], as shown in Figure 1 (center).

The cVAE objective minimizes the KL divergence between the true latent posterior distribution and its variational approximation resulting in an objective of the form [sohn2015learning]:


The first term is the expected conditional log-likelihood (CLL) under the variational distribution and the second term can be seen as the regularization forcing the posterior distribution to match the conditional prior distribution . In cVAE, the posterior density is modeled as a diagonal Gaussian density for tractability reasons: . The mean

and variance

are predicted using an encoder network parameterized by . The decoder and prior networks are parameterized by and respectively.

Normalizing flows can be used to transform simple base distributions into complex ones using a sequence of bijective transformations (the flow chain) with easy to compute Jacobians [rezende2015variational, papamakarios2019normalizing]. They basically extend the change of variable rule to transform a base distribution into a target distribution in successive steps. Normalizing flows can transform a simple base distribution into an arbitrarily complex target distribution, , by composing complex flow transformations with simpler flow steps [papamakarios2019normalizing].

Consider one such bijective transformation composed of steps:


Forward evaluation of this flow chain, transforming , can be written as:


where is distributed according to the base distribution .
Reverse evaluation of the flow chain, transforming , can be written as:


The transformed distribution, , is obtained from the base distribution, , adjusted by the inverse absolute Jacobian determinant of the flow transformation. For a single flow step :


where denotes the Jacobian determinant. The complete transformation using the full flow chain in log domain is given by


where the last equality follows from .

3 Methods

When using cVAE-like models for medical image segmentation tasks, it is assumed that the diversity of segmentations is captured with the latent posterior distribution. However, using a simple distribution, such as an axis-aligned Gaussian, to approximate the latent posterior distribution can be too restrictive and might not be sufficiently expressive to capture richer variations. This is noticeable in the Prob. U-Net model [kohl2018probabilistic] where the segmentations are blurry and lack diversity [baumgartner2019phiseg, kohl2019hierarchical]. It is in this context that normalizing flows can be used to improve the flexibility of the approximate posterior density to capture a richer diversity of high quality segmentations.

If we denote the approximate posterior density output by the encoder network as the base distribution, , using the latent variable , then using the idea of normalizing flows in Section 2 can yield more expressive posterior densities. If the base distribution is transformed using a flow chain of steps according to Eq. (2), then the transformed distribution after steps with can be written using Eq. (6) as:


It can be shown that the modified objective for the conditional flow-based model becomes (see Section 6.1 in Supplementary material for details):


Note that the expectation is with respect to the base distribution of the normalizing flow . The KL divergence is similar to the term for cVAE in Eq. 1 except for an additional term due to the log determinant of the Jacobian terms in Eq. (7).

Figure 2: Proposed cFlow Net model. (Training) The training process takes the reference segmentations and the image data as input to the encoder, which predicts the mean

and standard deviation

of the base distribution along with the context vector

for the flow transformation. The flow transformation block transforms the base distribution, to an approximation of the target posterior distribution in steps. The latent space is jointly learnt by minimizing the KL divergence between the transformed posterior distribution and the conditional prior . (Sampling) The sampling process involves obtaining samples from the conditional prior which is used with the input image together to be decoded in the decoder to obtain the segmentation . After training the model, only the sampling part of the network is used for inference.

3.0.1 Planar Flows:

In this work we use planar flows introduced in [rezende2015variational] modified to be conditioned on theinput image with each step of the flow:



are learnable parameters predicted by a conditioning neural network

similar to the conditioning network used in [lu2019structured], is the dimensionality of the latent space and is an element-wise non-linearity such as tanh with derivative . The Jacobian determinant for the planar flow step is given by


The conditioning on the flow chain is introduced through the context vector which is dependent on . The context vector is also predicted by the encoder network. The proposed cFlow Net model is visualized in Figure 2.

Note that at inference, to sample multiple segmentations only the Sampling Process part of the model is used. Given an image , the prior network can be used to obtain multiple latent variable samples which are then decoded by the decoder network to output multiple segmentations for the input image.

4 Experiments & Results

4.1 Data

All experiments are performed on two publicly available datasets. Both datasets comprise labels from at least two raters used to quantify the performance of all models. We use a training-validation-test split of 60:20:20 for both datasets.
LIDC-IDRI dataset: The LIDC-IDRI dataset consists of thoracic CT scans with four raters annotating the lesions in them [armato2004lung]. We use patches of size centered on lesions similar to the procedures followed in [kohl2018probabilistic, baumgartner2019phiseg] to obtain patches in total. The preprocessed data is obtained from [lidc].
Retinal Vessel dataset: As a secondary dataset we create a new dataset derived from three older retinal vessel segmentation datasets: DRIVE [staal2004ridge], STARE [hoover2000locating] and CHASE [owen2011retinal]. Each of these datasets has a subset of images with labels from two raters. We collected images with two raters from these three datasets, extracted retinal masks when there weren’t any and resized them such that all images are of height px. This yields images of which are of size px and the remaining are px. All images have vessel annotations from two raters.(Figure 4 in Supplementary material).

4.2 Experiments and Results

The proposed cFlow Net model is compared with the probabilistic U-Net [kohl2018probabilistic], and additionally with the deterministic U-Net [ronneberger2015u] for the single rater setting. Other than the cFlow Net model described in Section 3 with planar flows [rezende2015variational], we additionally report the cFlow model with conditional generative flow model which uses the Glow transformation steps [kingma2018glow, lu2019structured] (Section 6.2 in Supplementary material).

Performance of the models in the multiple annotator setting is evaluated based on the generalized energy distance () which captures the diversity of samples obtained from the generative models when compared to the annotators. It is given by


where are samples from the ground truth distribution, , comprising different raters, are samples from the generative distribution, , learned by the model and is 1-IoU ( intersection-over-union) measure. Additionally, we report the negative conditional log likelihood (-CLL approximated with samples (Section 6.3 in Supplementary material) and the dice accuracy for the single rater settings.

Models LIDC Dataset Retina Dataset
All Raters Single Rater All Raters Single Rater
-CLL -CLL Dice -CLL -CLL Dice
Det.U-Net [ronneberger2015u] 0.727 0.624
Prob.U-Net [kohl2018probabilistic] 52.1 0.279 238.9 0.579 0.698 4.738 0.905 4.495 0.946 0.616
cFlow Net (Planar) 47.3 0.204 89.0 0.288 0.713 4.436 0.884 4.482 0.877 0.632
cFlow Net (Glow) 49.2 0.302 217.0 0.547 0.704 4.482 0.901 4.488 0.878 0.620
Table 1: Performance comparison of all models. Higher is better for Dice and lower is better for -CLL and . Significant differences are shown in bold.

Both variants of the cFlow Net models use flow steps. The decoder network in the cFlow Net and Prob. U-Net was a deterministic U-Net with 4 resolutions identical to the ones used in [kohl2018probabilistic]. Architectures of both encoder and prior networks were similar to the encoding path of the decoder network. In addition to predicting the mean and variance , the encoder network in the cFlow Net model outputs a context vector of dimension which is input to the flow transformation block as illustrated in Figure 2. The conditioning network

is a three layered multi-layer perceptron (MLP) with 8 hidden units. Latent space dimension of

was used for the Prob. U-Net and the cFlow Net models. All the models were trained using a batch size of and a learning rate of with the Adam optimizer [kingma2014adam]. The models were trained for a maximum of epochs and training convergence was assumed when there was no improvement in validation loss for epochs. Models with the best validation loss was used to evaluate the performance on test set reported in Table 1

. The experiments were run using PyTorch 

[paszke2019pytorch]222Open source implementation of our model will be made available here: https://github.com/raghavian/cFlow on a single Tesla K80 GPU with 12GB memory. The computation time for both variants of the cFlow Net models on LIDC dataset was s, and about s on the Retinal Vessel dataset per training epoch.

4.3 Results & Discussion

Performance of all the models on test set of both the datasets are reported in Table 1. Within each dataset we report the performance when compared to All Raters and a Single Rater

. Statistically significant improvement in performance (based on paired sample t-tests with

) when compared to other models are highlighted in bold.

Figure 3: Qualitative results showing the segmentation diversity of the cFlow Net model and Prob. U-Net for one scan from LIDC-IDRI test set. First row shows the input image, segmentation masks from the four raters; Rows 2 and 3 are samples from cFlow Net model when trained with all and a single (first) rater; Rows 4 and 5 show samples from the Prob. U-Net model for all and single rater setting. Mean prediction over all samples are shown in the last column (brigher regions correspond to higher probability).

The proposed cFlow Net (Planar) model is consistently better than the baseline Prob. U-Net model on the LIDC dataset in and -CLL measures. The performance of the cFlow Net (Planar) model in the Single Rater setting shows a large improvement when compared to Prob. U-Net model. This is also demonstrated in Figure 3 seen as more realistic and diverse samples generated only by training only on a single (the first) rater. There is a small reduction in performance of all the conditional generative models when compared to the Det. U-Net model in dice accuracy.

The significant improvements in for the cFlow Net models reported in Table 1 are also reflected qualitatively in the samples shown in Figure 3. Samples from cFlow Net (row 2) are not only able to capture the variations amongst all four raters (row 1) but the remainder samples appear plausible. When trained with a single rater (row 3), the cFlow Net model is still able to capture a richer diversity of segmentations. As annotations are available from only a single rater in majority of applications, this behaviour of the cFlow Net of being able to capture diverse segmentations from single rater is desirable. This is in contrast with the samples from Prob. U-Net even when trained with all raters (row 4), where the samples appear blurry and are unable to reflect the diversity of the four raters. This lack of diversity becomes more pronounced when trained with a single rater, as the Prob. U-Net model outputs almost identical looking samples (row 5).

In the last column of Figure 3 we also show the mean prediction obtained from samples of each model (brighter regions have higher probability). The mean predictions from the cFlow Net model trained on a single rater could be more informative than the mean prediction from Prob. U-Net trained on a single rater. This further strengthens our argument that improving the approximation to the latent posterior distribution with conditional normalizing flows helps capture meaningful uncertainty with the possibility of sampling unlimited number of diverse segmentations.

A similar trend is also observed with the Retinal Vessel dataset. This is a far more challenging dataset as the images are acquired differently and the quality of annotations vary between the six annotators. This is captured as higher and -CLL across all models. Even within this setting, the cFlow Net models fare better than the Prob. U-Net model in both the single and multiple rater experiments. There was no significant difference in dice accuracy between any of the methods indicating the stochastic generative components of the proposed models do not affect segmentation accuracy.

5 Conclusion

We proposed a novel conditional generative model based on conditional normalizing flows to quantify uncertainty in segmentations. The use of cFlow steps improved the approximation of the latent posterior distribution, captured in the smaller negative conditional log likelihood values and also manifested in the diversity of samples. The primary contribution in this work is the incorporation of conditional normalizing flows for handling high dimensional data such as medical images. The

flow transformation block is modular and can be easily replaced with any suitable normalizing flow providing access to a rich class of improved conditional generative models [papamakarios2019normalizing]. We demonstrated this feature of cFlow Net with two types of normalizing flow transformations: Planar [rezende2015variational] and Glow [kingma2018glow] with promising performance.


6 Supplementary Material

6.1 Derivation of cFlow Net objective

We start with the motivation of approximating the latent posterior distribution with a variational distribution and minimizing the reverse KL divergence,

The last equality is due to Bayes’ Rule. A little rearranging yields,

Note the first term is and the first two terms form a lower bound on the conditional likelihood. Thus,




The bound is equal to the conditional likelihood when the KL divergence between the variational distribution and the true posterior is zero. This is the reason we can optimize a surrogate objective such as the bound on the conditional likelihood to indirectly optimize the KL divergence.

The negative of the bound on the conditional log likelihood is the standard objective in a cVAE given in Eq. (1). With

steps of normalizing flows, we can factorise the random variable at step

with first order Markov assumption as


The transformed densities at each step are related by the flow transformation in Eq. (2)


Following this, the full factorisation with can be written as:


where we set the transformed variable at step to be the posterior variable of interest i.e., . Using this flow transformed distribution from Eq. (17) in Eq. (1) yields the final cFlow Net objective.


6.2 cFlow Net (Glow)

We also demonstrated the performance of cFlow Net model which uses generative flow (Glow) model [kingma2018glow] with conditioning to transform the base distribution. We followed a strategy similar to [lu2019structured] to obtain conditional Glow steps wherein we use a neural network which takes the context vector as input to predict the parameters of the Glow steps.

Each Glow step comprises three sub-steps and transforms the random variable conditioned on the context vector into :

Sub-step 1. ActNorm:

Log Det. (21)

Sub-step 2. 1x1 Convolution:

Log Det. (24)

Sub-step 3. Affine Coupling:

Log Det. (30)

with , and are MLPs, are intermediate transformed variables and is the transformed variable after one complete Glow step.

6.3 Estimation of conditional log likelihood

The the conditional log likelihood reported in Table 1 is obtained by marginalizing over the latent variable approximated using a Monte Carlo estimate with samples for each image in the test set:

Figure 4: Examples from the Retinal Vessel Segmentation dataset