1 Introduction
The semantic segmentation task assigns a class label to each pixel in an image. While in many cases the context in the image provides sufficient information to resolve the ambiguities in this mapping, there exists an important class of images where even the full image context is not sufficient to resolve all ambiguities. Such ambiguities are common in medical imaging applications, e.g., in lung abnormalities segmentation from CT images. A lesion might be clearly visible, but the information about whether it is cancer tissue or not might not be available from this image alone. Similar ambiguities are also present in photos. E.g. a part of fur visible under the sofa might belong to a cat or a dog, but it is not possible from the image alone to resolve this ambiguity^{1}^{1}1In lee2016stochastic this is defined as ambiguous evidence in contrast to implicit class confusion, that stems from an ambiguous class definition (e.g. the concepts of desk vs. table). For the presented work this differentiation is not required.
. Most existing segmentation algorithms either provide only one likely consistent hypothesis (e.g., “all pixels belong to a cat”) or a pixelwise probability (e.g., “each pixel is 50% cat and 50% dog”).
Especially in medical applications where a subsequent diagnosis or a treatment depends on the segmentation map, an algorithm that only provides the most likely hypothesis might lead to misdiagnoses and suboptimal treatment. Providing only pixelwise probabilities ignores all covariances between the pixels, which makes a subsequent analysis much more difficult if not impossible. If multiple consistent hypotheses are provided, these can be directly propagated into the next step in a diagnosis pipeline, they can be used to suggest further diagnostic tests to resolve the ambiguities, or an expert with access to additional information can select the appropriate one(s) for the subsequent steps.
Here we present a segmentation framework that provides multiple segmentation hypotheses for ambiguous images (Fig. 1a). Our framework combines a conditional variational auto encoder (CVAE) vae1 ; vae2 ; vae3 ; vae4 which can model complex distributions, with a UNet Ronneberger2015 which delivers stateoftheart segmentations in many medical application domains. A lowdimensional latent space encodes the possible segmentation variants. A random sample from this space is injected into the UNet to produce the corresponding segmentation map. One key feature of this architecture is the ability to model the joint probability of all pixels in the segmentation map. This results in multiple segmentation maps, where each of them provides a consistent interpretation of the whole image. Furthermore our framework is able to also learn hypotheses that have a low probability and to predict them with the corresponding frequency. We demonstrate these features on a lung abnormalities segmentation task, where each lesion has been segmented independently by four experts, and on the Cityscapes dataset, where we artificially flip labels with a certain frequency during training.
A body of work with different approaches towards probabilistic and multimodal segmentation exists. The most common approaches provide independent pixelwise probabilities kendall2015bayesian ; kendall2017uncertainties . These models induce a probability distribution by using dropout over spatial features. Whereas this strategy fulfills this line of work’s objective of quantifying the pixelwise uncertainty, it produces inconsistent outputs. A simple way to produce plausible hypotheses is to learn an ensemble of (deep) models lakshminarayanan2017simple . While the outputs produced by ensembles are consistent, they are not necessarily diverse and ensembles are typically not able to learn the rare variants as their members are trained independently. In order to overcome this, several approaches train models jointly using the oracle set loss guzman2012multiple , i.e. a loss that only accounts for the closest prediction to the ground truth. This has been explored in lee2015m and lee2016stochastic using an ensemble of deep networks, and in rupprecht2017learning and Ilg2018uncertainty using one common deep network with heads. While multihead approaches may have the capacity to capture a diverse set of variants, they are not equipped to learn the occurrence frequencies of individual variants. Two common disadvantages of both ensembles and heads models are their ungraceful scaling to large numbers of hypotheses, and their requirement of fixing the number of allowed hypotheses at training time. Another set of approaches to produce multiple diverse solutions relies on graphical models, such as junction chains chen2013computing , and more generally Markov Random Fields batra2012diverse ; kirillov2015inferring ; kirillov2015m ; kirillov2016joint . While many of the previous approaches are guaranteed to find the best diverse solutions, these are confined to structured problems whose dependencies can be described by tractable graphical models.
The task of imagetoimage translation
isola2017image tackles a very similar problem: an underconstrained domain transfer of images needs to be learned. Many of the recent approaches employ generative adversarial networks (GANs) which are known to suffer from challenges such as ‘modecollapse’ goodfellow2016nips . In an attempt to solve the modecollapse problem, the ‘bicycleGAN’ zhu2017toward involves a component that is similar in architecture to ours. In contrast to our proposed architecture, their model encompasses a fixed prior distribution and during training their posterior distribution is only conditioned on the output image. Very recent work on generating appearances given a shape encoding esser2018variational also combines a UNet with a VAE, and was developed concurrently to ours. In contrast to our proposal, their training requires an additional pretrained VGGnet that is employed as a reconstruction loss. Finally, in bouchacourt2016disco is proposed a probabilistic model for structured outputs based on optimizing the dissimilarity coefficient rao1982diversitybetween the ground truth and predicted distributions. The resultant approach is assessed on the task of hand pose estimation, that is, predicting the location of 14 joints, arguably a simpler space compared to the space of segmentations we consider here. Similarly to the approach presented below, they inject latent variables at a later stage of the network architecture.
The main contributions of this work are: (1) Our framework provides consistent segmentation maps instead of pixelwise probabilities and can therefore give a joint likelihood of modes. (2) Our model can induce arbitrarily complex output distributions including the occurrence of very rare modes, and is able to learn calibrated probabilities of segmentation modes. (3) Sampling from our model is computationally cheap. (4) In contrast to many existing applications of deep generative models that can only be qualitatively evaluated, our application and datasets allow quantitative performance evaluation including penalization of missing modes.
2 Network Architecture and Training Procedure
Our proposed network architecture is a combination of a conditional variational auto encoder vae1 ; vae2 ; vae3 ; vae4 with a UNet Ronneberger2015 , with the objective of learning a conditional density model over segmentations, conditioned on the image.
Sampling. The central component of our architecture (Fig. 1a) is a lowdimensional latent space (e.g., , which performed best in our experiments). Each position in this space encodes a segmentation variant. The ‘prior net’, parametrized by weights , estimates the probability of these variants for a given input image
. This prior probability distribution (called
in the following) is modelled as an axisaligned Gaussian with mean and variance . To predict a set of segmentations we apply the network times to the same input image (only a small part of the network needs to be reevaluated in each iteration, see below). In each iteration , we draw a random sample from(1) 
broadcast the sample to an channel feature map with the same shape as the segmentation map, and concatenate this feature map to the last activation map of a UNet (the UNet is parameterized by weights ). A function composed of three subsequent convolutions ( being the set of their weights) combines the information and maps it to the desired number of classes. The output, , is the segmentation map corresponding to point in the latent space:
(2) 
Notice that when drawing samples for the same input image, we can reuse the output of the prior net and the feature activations of the UNet. Only the function needs to be reevaluated times.
Training. The networks are trained with the standard training procedure for conditional VAEs (Fig. 1b), i.e. by minimizing the variational lower bound (Eq. 4). The main difference with respect to training a deterministic segmentation model, is that the training process additionally needs to find a useful embedding of the segmentation variants in the latent space. This is solved by introducing a ‘posterior net’, parametrized by weights , that learns to recognize a segmentation variant (given the raw image and the ground truth segmentation ) and to map this to a position with some uncertainty in the latent space. The output is denoted as posterior distribution . A sample from this distribution,
(3) 
combined with the activation map of the UNet (Eq. 1) must result in a predicted segmentation identical to the ground truth segmentation provided in the training example. A crossentropy loss penalizes differences between and (the crossentropy loss arises from treating the output as the parameterization of a pixelwise categorical distribution
). Additionally there is a KullbackLeibler divergence
which penalizes differences between the posterior distribution and the prior distribution . Both losses are combined as a weighted sum with a weighting factor , as done in higgins2016beta :(4) 
The training is done from scratch with randomly initialized weights. During training, this KL loss “pulls” the posterior distribution (which encodes a segmentation variant) and the prior distribution towards each other. On average (over multiple training examples) the prior distribution will be modified in a way such that it “covers” the space of all presented segmentation variants for a specific input image^{2}^{2}2An open source reimplementation of our approach can be found at https://github.com/SimonKohl/probabilistic_unet..
3 Performance Measures and Baseline Methods
In this section we first present the metric used to assess the performance of all approaches, and then describe each competitor approach used in the comparisons.
3.1 Performance measures
As it is common in the semantic segmentation literature, we employ the intersection over union (IoU) as a measure to compare a pair of segmentations. However, in the present case, we not only want to compare a deterministic prediction with a unique ground truth, but rather we are interested in comparing distributions of segmentations. To do so, we use the generalized energy distance bellemare2017cramer ; salimans2018improving ; szekely2013energy , which leverages distances between observations:
(5) 
where is a distance measure, and are independent samples from the ground truth distribution , and similarly, and are independent samples from the predicted distribution . The energy distance is a metric as long as is also a metric klebanov2005n . In our case we choose , which as proved in kosub2016note ; lipkus1999proof , is a metric. In practice, we only have access to samples from the distributions that models induce, so we rely on statistics of Eq. 5, . The details about its computation for each experiment are presented in Appendix B.
3.2 Baseline methods
With the aim of providing context for the performance of our proposed approach we compare against a range of baselines. To the best of our knowledge there exists no other work that has considered capturing a distribution over multimodal segmentations and has measured the agreement with such a distribution. For fair comparison, we train the baseline models whose architectures are depicted in Fig. 2 in the exact same manner as we train ours. The baseline methods all involve the same UNet architecture, i.e. they share the same core component and thus employ comparable numbers of learnable parameters in the segmentation tasks.
Dropout UNet (Fig. 2a). Our ‘Dropout UNet’ baselines follow the Bayesian segnet’s kendall2015bayesian proposition: we dropout the activations of the respective incoming layers of the three innermost encoder and decoder blocks with a dropout probability of during training as well as when sampling.
UNet Ensemble (Fig. 2b). We report results for ensembles with the number of members matching the required number of samples (referred to as ‘UNet Ensemble’). The original deterministic variant of the UNet is the 1sample corner case of an ensemble.
MHeads (Fig. 2c). Aiming for diverse semantic segmentation outputs, the works of rupprecht2017learning and Ilg2018uncertainty propose to branch off M heads after the last layer of a deep net each of which contributes one output variant. An adjusted crossentropy loss that adaptively assigns heads to groundtruth hypotheses is employed to promote diversity while reducing the risk of idle heads: the loss of the best performing head is weighted with a factor of , while the remaining heads each contribute with a weight of to the loss. For our ‘MHeads’ baselines we again employ a UNet core and set as proposed by rupprecht2017learning . In order to allow for the evaluation of 4, 8 and 16 samples, we train MHeads models with the corresponding number of heads.
Image2Image VAE (Fig. 2d). In zhu2017toward
the authors propose a UNet VAEGAN hybrid for multimodal imagetoimage translation, that owes its stochasticity to normal distributed latents that are broadcasted and fed into the encoder path of the UNet. In order to deal with the complex solution space in imagetoimage translation tasks, they employ an adversarial discriminator as additional supervision alongside a reconstruction loss. In the fully supervised setting of semantic segmentation such an additional learning signal is however not necessary and we therefore train with a crossentropy loss only. In contrast to our proposition, this baseline, which we refer to as the ‘Image2Image VAE’, employs a prior that is not conditioned on the input image (a fixed normal distribution) and a posterior net that is not conditioned on the input either.
In all cases we examine the models’ performance when drawing a different number of samples (1, 4, 8 and 16) from each of them.
4 Results
A quantitative evaluation of multiple segmentation predictions per image requires annotations from multiple labelers. Here we consider two datasets: The LIDCIDRI dataset armato2015 ; armato2011lung ; clark2013cancer which contains 4 annotations per input, and the Cityscapes dataset cordts2016cityscapes , which we artificially modify by adding synonymous classes to introduce uncertainty in the way concepts are labelled.
4.1 Lung abnormalities segmentation
The LIDCIDRI dataset armato2015 ; armato2011lung ; clark2013cancer contains 1018 lung CT scans from 1010 lung patients with manual lesion segmentations from four experts. This dataset is a good representation of the typical ambiguities that appear in CT scans. For each scan, 4 radiologists (from a total of 12) provided annotation masks for lesions that they independently detected and considered to be abnormal. We use the masks resulting from a second reading in which the radiologists were shown the anonymized annotations of the others and were allowed to make adjustments to their own masks.
For our experiments we split this dataset into a training set composed of 722 patients, a validation set composed of 144 patients, and a test set composed of the remaining 144 patients. We then resampled the CT scans to inplane resolution (the original resolution is between and , on average) and cropped 2D images ( pixels) centered at the lesion positions. The lesion positions are those where at least one of the experts segmented a lesion. By cropping the scans, the resultant task is in isolation not directly clinically relevant. However, this allows us to ignore the vast areas in which all labelers agree, in order to focus on those where there is uncertainty. This resulted in 8882 images in the training set, 1996 images in the validation set and 1992 images in the test set. Because the experts can disagree whether the lesion is abnormal tissue, up to 3 masks per image can be empty. Fig. 3a shows an example of such lesioncentered images and the masks provided by 4 graders.
As all models share the same UNet core component and for fairness and ease of comparability, we let all models undergo the same training schedule, which is detailed in subsection H.1.
In order to grasp some intuition about the kind of samples produced by each model, we show in Fig. 3a, as well as in Appendix F, representative results for the baseline methods and our proposed Probabilistic UNet. Fig. 4a shows the squared generalized energy distance for all models as a function of the number of samples. The data accumulations visible as horizontal stripes are owed to the existence of empty groundtruth masks. The energy distance on the 1992 images large lung abnormalities test set, decreases for all models as more samples are drawn indicating an improved matching of the groundtruth distribution as well as enhanced sample diversity. Our proposed Probabilistic UNet outperforms all baselines when sampling 4, 8 and 16 times (numerical results can be found in Table 2). The performance at 16 samples is found significantly higher than that of the baselines value , according to the Wilcoxon signedrank test. Finally, in Appendix E we show the results of an experiment regarding the capacity different models have to distinguish between unambiguous and ambiguous instances (i.e. instances where graders disagree on the presence of a lesion).
4.2 Cityscapes semantic segmentation
As a second dataset we use the Cityscapes dataset cordts2016cityscapes . It contains images of street scenes taken from a car with corresponding semantic segmentation maps. A total of 19 different semantic classes are labelled. Based on this dataset we designed a task that allows full control of the ambiguities: we create ambiguities by artificial random flips of five classes to newly introduced classes. We flip ‘sidewalk’ to ‘sidewalk 2’ with a probability of , ‘person’ to ‘person 2’ with a probability of , ‘car’ to ‘car 2’ with , ‘vegetation’ to ‘vegetation 2’ with and ‘road’ to ‘road 2’ with probability . This choice yields distinct probabilities for the ensuing discrete modes with probabilities ranging from 10.9% (all unflipped) down to 0.5% (all flipped). The official training dataset with finegrained annotation labels comprises 2975 images and the validation dataset contains 500 images. We employ this offical validation set as a test set to report results on, and split off 274 images (corresponding to the 3 cities of Darmstadt, Mönchengladbach and Ulm) from the official training set as our internal validation set. As in the previous experiment, in this task we use a similar setting for the training processes of all approaches, which we present in detail in subsection H.2.
Fig. 3b shows samples of each approach in the comparison given one input image. In Appendix G we show further samples of other images, produced by our approach. Fig. 4b shows that the Probabilistic UNet on the Cityscapes task outperforms the baseline methods when sampling 4, 8 and 16 times in terms of the energy distance (numerical results can be found in Table 3). This edge in segmentation performance at 16 samples is highly significant according to the Wilcoxon signedrank test value . We have also conducted ablation experiments in order to explore which elements of our architecture contribute to its performance. These were (1) Fixing the prior, (2) Fixing the prior, and not using the context in the posterior and (3) Injecting the latent features at the beginning of the UNet. Each of these variations resulted in a lower performance. Detailed results can be found in Appendix D.
Reproducing the segmentation probabilities.
In the Cityscapes segmentation task, we can provide further analysis by leveraging our knowledge of the underlying conditional distribution that we have set by design. In particular we compare the frequency with which every model predicts each mode, to the corresponding ground truth probability of that mode. To compute the frequency of each mode by each model, we draw 16 samples from that model for all images in the test set. Then we count the number of those samples that have that mode as the closest (using 1IoU as the distance function).
In Fig. 5 (and Figs. 8, 9, 10 in Appendix C) we report the modewise frequencies for all 32 modes in the Cityscape task and show that the Probabilistic UNet is the only model in this comparison that is able to closely capture the frequencies of a large combinatorial space of hypotheses including very rare modes, thus supplying calibrated likelihoods of modes. The Image2Image VAE is the only model among competitors that picks up on all variants, but the frequencies are far off as can be seen in its deviation from the bisector line in blue. The other baselines perform worse still in that all of them fail to represent modes and the modes they do capture do not match the expected frequencies.
4.3 Analysis of the Latent Space
The embedding of the segmentation variants in a lowdimensional latent space allows a qualitative analysis of the internal representation of our model. For a 2D or 3D latent space we can directly visualize where the segmentation variants get assigned. See Appendix A for details.
5 Discussion and conclusions
Our first set of experiments demonstrates that our proposed architecture provides consistent segmentation maps that closely match the multimodal groundtruth distributions given by the expert graders in the lung abnormalities task and by the combinatorial groundtruth segmentation modes in the Cityscapes task. The employed IoUbased energy distance measures whether the models’ individual samples are both coherent as well as whether they are produced with the expected frequencies. It not only penalizes predicted segmentation variants that are far away from the ground truth, but also penalizes missing variants. On this task the Probabilistic UNet is able to significantly outperform the considered baselines, indicating its capability to model the joint likelihood of segmentation variants.
The second type of experiments demonstrates that our model scales to complex output distributions including the occurrence of very rare modes. With 32 discrete modes of largely differing occurrence likelihoods (0.5% to 10.9%), the Cityscapes task requires the ability to closely match complex data distributions. Here too our model performs best and picks the segmentation modes very close to the expected frequencies, all the way into the regime of very unlikely modes, thus defying modecollapse and exhibiting excellent probability calibration. As an additional advantage our model scales to such large numbers of modes without requiring any prior assumptions on the number of modes or hypotheses.
The lower performance of the baseline models relative to our proposition can be attributed to design choices of these models. While the Dropout UNet successfully models the pixelwise data distribution (Fig. 8a bottom right, in the Appendix), such pixelwise mixtures of variants can not be valid hypotheses in themselves (see Fig. 3). The UNet Ensemble’s members are trained independently and each of them can only learn the most likely segmentation variant as attested to by Fig. 8b. In contrast to that the closely related MHeads model can pick up on multiple discrete segmentation modes, due to the joint training procedure that enables diversity. The training does however not allow to correctly represent frequencies and requires knowledge of the number of present variants (see Fig. 9a, in the Appendix). Furthermore neither the UNet Ensemble, nor the MHeads can deal with the combinatorial explosion of segmentation variants when multiple aspects vary independently of each other. The Image2Image VAE shares similarities with our model, but as its prior is fixed and not conditioned on the input image, it can not learn to capture variant frequencies by allocating corresponding probability mass to the respective latent space regions. Fig. 17 in the Appendix shows a severe misscalibration of variant likelihoods on the lung abnormalities task that is also reflected in its corresponding energy distance. Furthermore, in this architecture, the latent samples are fed into the UNet’s encoder path, while we feed in the samples just after the decoder path. This design choice in the Image2Image VAE requires the model to carry the latent information all the way through the UNet core, while simultaneously performing the recognition required for segmentation, which might additionally complicate training (see analysis in Appendix D). Beside that, our design choice of late injection has the additional advantage that we can produce a large set of samples for a given image at a very low computational cost: for each new sample from the latent space only the network part after the injection needs to be reexecuted to produce the corresponding segmentation map (this bears similarity to the approach taken in bouchacourt2016disco , where a generative model is employed to model hand pose estimation).
Aside from the ability to capture arbitrary modes with their corresponding probability conditioned on the input, our proposed Probabilistic UNet allows to inspect its latent space. This is because as opposed to e.g. GANbased approaches, VAElike models explicitly parametrize distributions, a characteristic that grants direct access to the corresponding likelihood landscape. Appendix A discusses how the Probabilistic UNet chooses to structure its latent spaces.
Compared to aforementioned concurrent work for imagetoimage tasks esser2018variational , our model disentangles the prior and the segmentation net. This can be of particular relevance in medical imaging, where processing 3D scans is common. In this case it is desirable to condition on the entire scan, while retaining the possibility to process the scan tile by tile in order to be able to process large volumes with large models with a limited amount of GPU memory.
On a more general note, we would like to remark that current imagetoimage translation tasks only allow subjective (and expensive) performance evaluations, as it is typically intractable to assess the entire solution space. For this reason surrogate metrics such as the inception score based on the evaluation via a separately trained deep net are employed salimans2016improved . The task of multimodal semantic segmentation, which we consider here, allows for a direct and thus perhaps more meaningful manner of performance evaluation and could help guide the design of future generative architectures.
All in all we see a large field where our proposed Probabilistic UNet can replace the currently applied deterministic UNets. Especially in the medical domain, with its often ambiguous images and highly critical decisions that depend on the correct interpretation of the image, our model’s segmentation hypotheses and their likelihoods could 1) inform diagnosis/classification probabilities or 2) guide steps to resolve ambiguities. Our method could prove useful beyond explicitly multimodal tasks, as the inspectability of the Probabilistic UNet’s latent space could yield insights for many segmentation tasks that are currently treated as a unimodal problem.
6 Acknowledgements
The authors would like to thank Mustafa Suleyman, Trevor Back and the whole DeepMind team for their exceptional support, and Shakir Mohamed and Andrew Zisserman for very helpful comments and discussions. The authors acknowledge the National Cancer Institute and the Foundation for the National Institutes of Health, and their critical role in the creation of the free publicly available LIDC/IDRI Database used in this study.
References
 [1] Lee, S., Prakash, S.P.S., Cogswell, M., Ranjan, V., Crandall, D., Batra, D.: Stochastic multiple choice learning for training diverse deep ensembles. In: Advances in Neural Information Processing Systems. (2016) 2119–2127
 [2] Kingma, D.P., Welling, M.: Autoencoding variational bayes. In: Proceedings of the 2nd international conference on Learning Representations (ICLR). (2013)

[3]
Jimenez Rezende, D., Mohamed, S., Wierstra, D.:
Stochastic backpropagation and approximate inference in deep generative models.
In: Proceedings of the 31st International Conference on Machine Learning (ICML). (2014)
 [4] Kingma, D.P., Jimenez Rezende, D., Mohamed, S., Welling, M.: Semisupervised learning with deep generative models. In: Neural Information Processing Systems (NIPS). (2014)
 [5] Sohn, K., Lee, H., Yan, X.: Learning structured output representation using deep conditional generative models. In: Advances in Neural Information Processing Systems. (2015) 3483–3491
 [6] Ronneberger, O., Fischer, P., Brox, T.: Unet: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and ComputerAssisted Intervention (MICCAI) 2015. Volume 9351 of LNCS., Springer (2015) 234–241
 [7] Kendall, A., Badrinarayanan, V., Cipolla, R.: Bayesian segnet: Model uncertainty in deep convolutional encoderdecoder architectures for scene understanding. arXiv preprint arXiv:1511.02680 (2015)

[8]
Kendall, A., Gal, Y.:
What uncertainties do we need in bayesian deep learning for computer vision?
In: Advances in Neural Information Processing Systems. (2017) 5580–5590  [9] Lakshminarayanan, B., Pritzel, A., Blundell, C.: Simple and scalable predictive uncertainty estimation using deep ensembles. In: Advances in Neural Information Processing Systems. (2017) 6405–6416
 [10] GuzmanRivera, A., Batra, D., Kohli, P.: Multiple choice learning: Learning to produce multiple structured outputs. In: Advances in Neural Information Processing Systems. (2012) 1799–1807
 [11] Lee, S., Purushwalkam, S., Cogswell, M., Crandall, D., Batra, D.: Why m heads are better than one: Training a diverse ensemble of deep networks. arXiv preprint arXiv:1511.06314 (2015)
 [12] Rupprecht, C., Laina, I., DiPietro, R., Baust, M., Tombari, F., Navab, N., Hager, G.D.: Learning in an uncertain world: Representing ambiguity through multiple hypotheses. In: International Conference on Computer Vision (ICCV). (2017)
 [13] Ilg, E., Çiçek, Ö., Galesso, S., Klein, A., Makansi, O., Hutter, F., Brox, T.: Uncertainty estimates for optical flow with multihypotheses networks. arXiv preprint arXiv:1802.07095 (2018)

[14]
Chen, C., Kolmogorov, V., Zhu, Y., Metaxas, D., Lampert, C.:
Computing the m most probable modes of a graphical model.
In: Artificial Intelligence and Statistics. (2013) 161–169
 [15] Batra, D., Yadollahpour, P., GuzmanRivera, A., Shakhnarovich, G.: Diverse mbest solutions in markov random fields. In: European Conference on Computer Vision, Springer (2012) 1–16
 [16] Kirillov, A., Savchynskyy, B., Schlesinger, D., Vetrov, D., Rother, C.: Inferring mbest diverse labelings in a single one. In: Proceedings of the IEEE International Conference on Computer Vision. (2015) 1814–1822
 [17] Kirillov, A., Shlezinger, D., Vetrov, D.P., Rother, C., Savchynskyy, B.: Mbestdiverse labelings for submodular energies and beyond. In: Advances in Neural Information Processing Systems. (2015) 613–621
 [18] Kirillov, A., Shekhovtsov, A., Rother, C., Savchynskyy, B.: Joint mbestdiverse labelings as a parametric submodular minimization. In: Advances in Neural Information Processing Systems. (2016) 334–342

[19]
Isola, P., Zhu, J.Y., Zhou, T., Efros, A.A.:
Imagetoimage translation with conditional adversarial networks.
arXiv preprint (2017)  [20] Goodfellow, I.: Nips 2016 tutorial: Generative adversarial networks. arXiv preprint arXiv:1701.00160 (2016)
 [21] Zhu, J.Y., Zhang, R., Pathak, D., Darrell, T., Efros, A.A., Wang, O., Shechtman, E.: Toward multimodal imagetoimage translation. In: Advances in Neural Information Processing Systems. (2017) 465–476
 [22] Esser, P., Sutter, E., Ommer, B.: A variational unet for conditional appearance and shape generation. arXiv preprint arXiv:1804.04694 (2018)
 [23] Bouchacourt, D., Mudigonda, P.K., Nowozin, S.: Disco nets: Dissimilarity coefficients networks. In: Advances in Neural Information Processing Systems. (2016) 352–360
 [24] Rao, C.R.: Diversity and dissimilarity coefficients: a unified approach. Theoretical population biology 21(1) (1982) 24–43
 [25] Higgins, I., Matthey, L., Pal, A., Burgess, C., Glorot, X., Botvinick, M., Mohamed, S., Lerchner, A.: betavae: Learning basic visual concepts with a constrained variational framework. In: International Conference on Learning Representations. (2017)
 [26] Bellemare, M.G., Danihelka, I., Dabney, W., Mohamed, S., Lakshminarayanan, B., Hoyer, S., Munos, R.: The cramer distance as a solution to biased wasserstein gradients. arXiv preprint arXiv:1705.10743 (2017)
 [27] Salimans, T., Zhang, H., Radford, A., Metaxas, D.: Improving gans using optimal transport. arXiv preprint arXiv:1803.05573 (2018)
 [28] Székely, G.J., Rizzo, M.L.: Energy statistics: A class of statistics based on distances. Journal of statistical planning and inference 143(8) (2013) 1249–1272
 [29] Klebanov, L.B., Beneš, V., Saxl, I.: Ndistances and their applications. Charles University in Prague, the Karolinum Press (2005)
 [30] Kosub, S.: A note on the triangle inequality for the jaccard distance. arXiv preprint arXiv:1612.02696 (2016)
 [31] Lipkus, A.H.: A proof of the triangle inequality for the tanimoto distance. Journal of Mathematical Chemistry 26(13) (1999) 263–265
 [32] Armato, I., Samuel, G., McLennan, G., Bidaut, L., McNittGray, M.F., Meyer, C.R., Reeves, A.P., Clarke, L.P.: Data from lidcidri. the cancer imaging archive. http://doi.org/10.7937/K9/TCIA.2015.LO9QL9SX (2015)
 [33] Armato, S.G., McLennan, G., Bidaut, L., McNittGray, M.F., Meyer, C.R., Reeves, A.P., Zhao, B., Aberle, D.R., Henschke, C.I., Hoffman, E.A., et al.: The lung image database consortium (lidc) and image database resource initiative (idri): a completed reference database of lung nodules on ct scans. Medical physics 38(2) (2011) 915–931
 [34] Clark, K., Vendt, B., Smith, K., Freymann, J., Kirby, J., Koppel, P., Moore, S., Phillips, S., Maffitt, D., Pringle, M., et al.: The cancer imaging archive (tcia): maintaining and operating a public information repository. Journal of digital imaging 26(6) (2013) 1045–1057

[35]
Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R.,
Franke, U., Roth, S., Schiele, B.:
The cityscapes dataset for semantic urban scene understanding.
In: Proceedings of the IEEE conference on computer vision and pattern recognition. (2016) 3213–3223
 [36] Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., Chen, X.: Improved techniques for training gans. In: Advances in Neural Information Processing Systems. (2016) 2234–2242
 [37] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
Appendix A Visualization of latent spaces
The segmentation variants from the proposed Probabilistic UNet correspond to latent space samples from the learned prior distribution. Fig. 6 and Fig. 7 below show samples from the Probabilistic UNet for an LIDCIDRI and a Cityscapes example respectively. The samples are arranged so as to represent their corresponding position in a 2Dplane of the respective latent space. This allows to interpret how the model ends up structuring the space to solve the given tasks.
a.1 Lung Abnormalities Segmentation
In the LIDCIDRI case the component of the prior happens to roughly encode lesion size including a transition to complete lesion absence. The probability mass allocated to absence is relatively small in the particular example, which arguably is in tune with the fact that 1 of the 4 graders assessed the image as lesion free. The component on the other hand appears to encode shape variations. In the training, the posterior and the prior distribution are tied by means of the KLdivergence. As a consequence they ‘live’ in the same space and the graders (alongside the image to condition on) can be projected into the same latent space. Fig. 6 shows the grader’s position in the form of green dots. The three graders that agree on presence, map into the 1sigma interval of the prior, while the grader predicting absence falls just short of the 4sigma isoprobability contour in the latentspace area that encodes absence. Fig. 3 gives more LIDCIDRI examples with their corresponding grader masks and 16 random samples of the Probabilistic UNet. It appears that our model agrees very well with cases for which there is intergrader disagreement on lesion presence. For cases where the graders agree on presence, our model at times apparently shows an underconservative prior, in the sense that uncertainty on presence can be elevated. The shape variations however are covered to a very good degree as attested by quantitative experiments above.
a.2 Street Scene Segmentation
In the Cityscapes task we employ a latent space with more dimensions than on the lung abnormalities task in order to equip the prior with sufficient capacity to encode the grader modes. The best performing model used a 6D latent space, however, for ease of presentation the following discusses the latent structure of a 3D latent space version. Fig. 7 shows a  plane of the latent space in which we again map corresponding segmentation samples, this time for a Cityscapes example. The precisely defined grader modes in the Cityscapes task can be identified with coherent regions in the latent space. As the space is 3D, not all 32 modes are fully manifest in the shown slice. The location of the modes is shown via white mode numbers and the degree of transparency indicated the proximity in relative to the shown slice. As this particular task involves discrete modes, the semantically different regions are coherent and well confined as hoped for. There however inevitably are transitions between those latent space regions that will translate to mixtures of the grader modes that cross over. Ideally these transitions are as sharp as possible relative to the order of magnitude of the prior variance, which is arguably the case. Fig. 18 shows Cityscapes examples with their corresponding grader masks and 16 random samples of the Probabilistic UNet. The shown samples exhibit largely coherent variants alongside occasional variant mixtures that correspond to semantic cross overs in the latent space. As alluded to quantitatively before, the samples also appear to respect the grader variant frequencies, which are captured by structuring the latentspace under the prior in such fashion that the correct probability mass is allocated to the respective mode. In the upper boundary region of Fig. 7 improper samples are found that show misssegmentations (although those are unlikely under the prior). The erroneously encoded modes found here are presumably attributable to the presence of inherent ambiguities in the dataset.
Appendix B Metrics
In the LIDC dataset, given that we have ground truth samples and samples from the models, we employ the following statistic:
(6) 
Here , where and are the predicted and ground truth masks of the lesion. In the case that both are empty masks, we define its distance to be , so that the metric rewards the agreement on lesion absence.
On the Cityscapes task, given that we have defined the settings, we have full knowledge about the ground truth distribution, which is a mixture of Dirac delta distributions. Hence, we do not need to sample from it, but use it directly in the estimator:
(7) 
where is the weight for the th mixture, which is a delta distribution containing all the density in . Here the distance depends on the average IoU of the switchable classes only. Predicting one of such classes that is not present in the ground truth leads to a score, which will be one of the terms over which we average. The computed average does not account for classes that are not present in both prediction and ground truth.
Appendix C How models fit the ground truth distribution
In this section we analyse the frequency in which each mode of the Cityscape task is targeted by each model, and how much that varies from the ground truth distribution. We report the modewise and pixelwise marginal occurrence frequencies of the sampled segmentation variants. In the modewise case, each sample is matched to its closest ground truth mode (using 1IoU as the distance function). Then, the frequency of each mode is computed by counting the number of samples that most closely match that mode. In the pixelwise case, the marginal frequencies  are obtained by counting all pixels across all images and corresponding samples that show a valid pixel hypothesis given the groundtruth, normalized by the number of respective unimodal groundtruth pixels. In Fig. 8 we present the results for UNet Ensemble and Dropout UNet, in Fig. 9 we show the results for MHeads and Image2Image VAE, finally in Fig. 10 we present the results for our approach.
Appendix D Ablation analysis
In this section we explore variations in the architecture of our approach, in order to understand how each design decision affects the performance. We have tried three variations over the original approach, these are:
Fixing the prior
: Instead of making the prior a function of the context, here we fix it to be a standard Gaussian distribution.
Fixing the prior, and not using the context (input image) in the posterior: In addition to fixing the prior to be Gaussian, we also make the posterior a function of the ground truth mask only, ignoring the context.
Injecting the latent features at the beginning of the UNet: Starting from our original model, we change the position in which the latent variables are used. Specifically here we concatenate them to the context (input image) and propagate that through the UNet.
In Fig. 11 we can observe that our approach is better than the other variations. As the mechanisms that induce the distributions over segmentations during sampling and training are blinded towards the context image, the performance in terms of the IoUbased energy distance decreases. In particular, our model is much better than the variation that injects latent samples at the beginning. This is a pleasant finding, given that our decision of injecting the latent variables at the end of the UNet was motivated by efficiency reasons when sampling. Here we find that we do not lose performance by doing so, but instead observe an improved matching of the samples with the groundtruth distribution. We hypothesize that injecting the latent variables at the final stage of the pipeline makes it easier for the model to account for different segmentations given the same input. This hypothesis is supported by the slightly better performance shown by the alternative architecture when sampling only once, and how this advantage is lost, and actually reversed, when sampling several times.
Appendix E Predicting ground truth ambiguity from models’ samples
In this section we assess the capacity of different models trained on LIDC for distinguishing between unambiguous and ambiguous instances. Specifically we define an instance to be ambiguous if 1 or more graders disagree on the presence of abnormal tissue. To do so, for each model we draw samples per instance (as in all other experiments in the paper) and count the number of lesion presences out of the . This lesion presence is binned in two histograms with bins, one for ambiguous and one for unambiguous instances (they are shown in Fig. 12). Finally we evaluate the discriminatory power of such histograms by computing the best threshold that separates ambiguous and unambiguous instances on the validation set. We present the accuracy scores on the test set in Table 1, which shows the advantage that our approach has over the competitors in this regard.
Dropout UNet  UNet Ensemble  MHeads  Image2Image VAE  Probabilistic UNet 

Appendix F Sampling LIDC masks using different models
Fig. 1317 show samples of our proposed model as well as all the baselines given the same input images. For reference the expert segmentations are shown in the four rows just below the images. Table 2 shows the numerical results from Fig. 4a.
# Samples  1  4  8  16 

Appendix G Sampling Cityscapes segmentations using our model
Fig. 18 shows samples of our proposed model on the Cityscapes dataset, and Table 3 shows the numerical results from Fig. 4b, so that new approaches can be compared to those.
# Samples  1  4  8  16 

Appendix H Training details
In this section we describe the architecture settings and training procedure for both experiments.
h.1 Lung abnormalities segmentation
We only use those lesions that were specified as a polygon (outline) in the XML files of the LIDC dataset, disregarding the ones that only have center of shape. That is, according to the LIDC paper we use the ones that are larger than 3mm, and filtering out the others, that are clinically less relevant armato2011lung . We also filter out each Dicom file whose absolute value of SliceLocation differs from the absolute value of ImagePositionPatient[1]. Finally we assume that two masks from different graders correspond to the same lesion if their tightest bounding boxes overlap.
During training imagegrader pairs are drawn randomly. We apply augmentations to the image tiles ( pixels size): random elastic deformation, rotation, shearing, scaling and a randomly translated crop that results in a tile size of pixels. The UNet architecture we use is similar to Ronneberger2015
with the exception that we down and upsample feature maps by using bilinear interpolations. The cores of all models are identical and feature 4 down and upsampling operations, at each scale the blocks comprise three convolutional layers with
kernels, each followed by a ReLUactivation. In our model, both the prior and the posterior (as well as the posterior in Image2Image VAE) nets have the same architecture as the UNet’s encoder path, i.e. they are made up to the same number of blocks and type of operations. Their last feature maps are global average pooled and fed into a
convolution that predicts the Gaussian distributions parameterized by mean and standard deviation. The architecture last layers, corresponding to
, comprise the appropriate number of kernels and are activated with a softmax. The base number of channels is 32 and is doubled or respectively halved at each down or upsampling transition. All individual models share this core component and for ease of comparability we let all models undergo the same training schedule: the training proceeds over iterations with an initial learning rate of that is lowered to in 5 steps. All weights of all models are initialized with orthogonal initialization having the gain (multiplicative factor) set to , and the bias terms are initialized by sampling from a truncated normal with . We use a batchsize of 32, weightdecay with weight and optimize using the Adam optimizer with default settings kingma2014adam . A KL weight of with a latent space of dimensions gave best validation results for the baseline Image2Image VAE, and and aD latent space performed well for the Probabilistic UNet, although the performances were alike across the hyperparameters tried on the validation set.
h.2 Cityscapes
We downsample the Cityscapes images and label maps to a size of . Similarly to above, we apply random elastic deformation, rotation, shearing, scaling, random translation and additionally impose random color augmentations on the images during training. The UNet cores in this task are identical to the ones above, but process an additional feature scale (implying one additional up and one additional downsampling operation). The training procedure is also equivalent to the previous experiment, also using iterations, except that here we employ a batchsize of 16, and the initial learning rate of is lowered to in 3 steps. The Cityscapes dataset includes ignore label masks for each image with which we mask the loss during training, and the metric during evaluation. A KL weight of and 3D latents gave best validation results for the Image2Image VAE and a and 6D latents performed best for the Probabilistic UNet (although 35D performed similarly).
Comments
There are no comments yet.