1 Introduction
Recent progress in Convolutional Neural Networks (CNNs) for image segmentation has enabled many possibilities in biomedical tissues analysis using fluorescence microscopy (FM) [1], which images fluorescent dyes (called markers) that target different cells or anatomical networks in biological specimens [2]. Different markers are combined and registered as image channels, which are for our purposes equivalent to modalities in other imaging techniques such as with different sequences in MRI. However, the maximum number of markers in a FM sample is limited due to their spectral overlap, thus requiring different combinations for each biological study. Furthermore, the lack of immunostaining consistency due to sample preparation difficulties, together with limited sample availability, often leads to datasets of images with missing markers, which have been challenging for traditional CNNs. To address this, a modality sampling and attention approach named Marker Sampling & Marker Excite (MSME) was proposed in [3], which allows for training and inference with a single model on datasets with heterogeneous marker combinations. Although MSME permits predictions on marker combinations unseen during training, one does not know in advance what segmentation quality to expect for such unseen combinations without any existing labeled data. Such quality estimators would be very valuable not only in predicting potential future performance, but also in deciding where to invest additional imaging and labeling effort.
Uncertainty estimation in deep neural networks with different Bayesian approximations was shown to improve model predictions, either by explaining this within the loss function or by aggregating predictions from ensembles
[4]. Such estimations have been successfully applied to medical image segmentation [5]. Epistemic uncertainty accounts for lack of confidence in the parameters of a model, i.e. uncertainty that can be reduced with additional labeled data. This can be estimated using socalled Dropout layers [6] both at training and inference time, known as Monte Carlo (MC) dropouts [7, 8]. Aleatoric uncertainty captures the inherent noise in the observations, i.e. uncertainty that cannot be reduced with additional data; and it can be estimated with the inclusion of a stochastic loss as proposed in [4]. We herein study the use of the above uncertainty estimation tools to design a method for the estimation of segmentation quality in the abovedescribed problem setting of heterogeneous FM marker combinations.2 Methods
As illustrated in Fig. 1, we extend MSME to also estimate aleatoric and epistemic uncertainties, using the summary statistics of which we train a regression Random Forest (RF) to predict quantitative segmentation outcomes.
2.1 Learning from images with missing markers
FM images are formed by different combinations of markers , with , and the number of possible markers represented as channels in an image . We denote the combination as the successive indexes of the markers it contains, e.g. is a combination of markers 2 and 4. The challenge of different marker combinations in FM images for both in training and testing was addressed in [3] using MSME
. Marker sampling (MS) refers to MC dropout of modalities at training time, to make testing generalizable to different availability of markers. For Marker Excite (ME), a featurewise attention module that has 2 fullyconnected layers and a onehot encoded input of available modalities for the sample is added at different layers of a UNet
[1]. For a detailed network structure and implementation details, see [3].2.2 Uncertainty estimation in CNNbased segmentation
Different uncertainties are estimated following the frameworks proposed in [4, 5] and included within a MSME model that is applied on an input image to predict its corresponding segmentation , where
are the logits resulting from the model:
.To estimate epistemic uncertainty of , MC Dropout is employed at different layers of a CNN
with probability
both at training and inference. Since the output of the network is stochastic, andare estimated respectively as the mean and standard deviation (SD) of
samples from . When explicitly stated, we add MC Dropout only after the last layer as proposed in [7].Aleatoric uncertainty
is calculated by explicitly adding a predictive variance
to our model output, i.e. . This model is trained with a stochastic crossentropy loss that adds a noise component to multiple () model predictions, so that the loss evaluates their mean:At inference, both and can be obtained without sampling.
Both techniques are simultaneously included in a single combined model that is employed to separately estimate both and . The number of prediction samples is set to in our experiments.
2.3 Predicting segmentation quality from uncertainty
Different uncertainty measures only provide visual cues about prediction errors. We herein, however, aim to obtain a quantitative estimation of the segmentation quality . To this end, we propose different regression models to obtain quality predictions from uncertainties obtained from as described above. We train on extracted from all possible marker combinations within the validation set of the segmentation task, and compare with the ground truth extracted from their corresponding segmentations. For comparison, herein we use the score for binary classification, i.e.:
We subsequently evaluate on the test set using a Root Mean Squared Error (RMSE) metric with respect to across all possible marker combinations. We employ 4fold crossvalidation on the same data split as the segmentation task.
A first approach tested is to train an additional CNN to predict . We design a simple regression architecture with a very small number of parameters to avoid overfitting to the limited size of the validation set. By denoting 2D convolutional layers with nodes and kernels as
, 2D max pooling layers with
kernel as MP, and fully connected layers with nodes as , we use the following CNN:. We add a ReLu activation after every convolutional or fully connected layer, and use a batch size of 2. We train for 100 epochs with
loss and Adam optimizer [9] with a learning rate of .Since different uncertainty maps may contain errorrelated information that qualitatively correlate with
, we hypothesize that traditional machinelearning models that have far lower degreesoffreedom by utilizing handcrafted globally informative features are potentially more robust to overfitting to the limited data available for this task. To this end, we alternatively train a RF model
with 128 trees, using mean squared error as split criterion. As features we use: uncertainty map percentiles (99 values from 1^{st} to 99^{th}), its cumulative histogram (in 13 bins from values 0.05 to 0.65), its first four statistical moments, and a onehot vector indicating which among all possible marker combinations is used. We compare three approaches:
with only epistemic features, with only aleatoric, and with both.3 Results and Discussion
3.1 Dataset and use of markers
We employ the FM dataset of bone marrow vasculature described in [10] with the experimental settings in [2]. The dataset contains 8 samples decomposed into different 2D patches (a total of 230), each with markers. We use the annotated class sinusoids, which amounts to 11.41% of the pixels, the rest being considered background. The samples are divided into 5 for training, 1 for validation, and 2 for testing. Since the segmentation quality vastly differs across marker combinations, we evaluate a relative segmentation score with respect to a reference model on the test set. This score compares the pairwise scores across 4 crossvalidation steps and all possible combinations (31) of the 5 available markers. Since it is not feasible to study all possible marker combinations in the training set, 7 training scenarios are proposed. In the first one, all 5 markers are available in all 5 training samples. In the rest, test cases #1 to #6 given in [2], where different markers are artificially ablated on each of the samples, are adopted herein for comparability with the baseline.
3.2 Utilizing uncertainties in segmentation
We herein study how the segmentation score is affected in our missing markers framework when accounting for the presented uncertainty methods in the MSME architecture. We compare the performance to a reference MSME without uncertainty estimation across all 31 possible marker combinations, 4 crossvalidation steps, and 7 missing marker scenarios. The results in Fig. 2(a) show that in the estimation of , it is best to use MC dropout in all convolutional layers with (), which is also superior to baseline MSME. We assess that the score superiority is not merely due to the use of Dropout by comparing it in Fig. 2(b) to a MSME model with standard Dropout layers (not sampling at inference) with the same probability, which is significantly worse than MSME and .
Using the stochastic loss for the estimation of with leads to an inferior score as seen in Fig. 2(c). But as discussed in the next part, such information is desirable regardless of the negative results. Thus, we employ that allows for prediction of both and while providing the best score together with .
3.3 Predicting segmentation quality for the unseen
In addition to the superior score achieved with the use of , the simultaneous estimation of and allows to visually inspect potential mistakes in the CNN prediction. Here, we use such uncertainty maps to estimate the score of segmented images. We evaluate our proposed methods on the training setting denominated case #6, for which patches have marker combinations , , , , or depending on which of the 5 training samples they belong to. This setting contains a variety of markers for each sample that depicts scenarios usually found in practise, and it allows to evaluate the predicted in markers not available for any of its training samples.
We show in Fig. 3(a) the RMSE results for the score prediction with the methods presented above. Using leads to worse RMSE results than any of the methods, which can be explained by the tendency of CNNs to overfit in such a small training set, and that the simplicity of the task asks for the efficient use of predefined explanatory features where RF methods excel. is seen to be superior in RMSE to both and . This observation can be attributed to the fact that and capture information about distinct segmentation errors that both help in the regression of a more accurate score value. Although further improvements could be achieved with deep ensembles, which have been reported to estimate more accurate uncertainties than MC dropout [11], they are computationally very expensive, which would annul a main advantage of our framework of training a single model.
We subsequently employ to estimate score for both seen and unseen marker combinations in our training scenario, which is observed in Fig. 3(b) to closely follow () the groundtruth predictions on average across crossvalidation steps. Thus, despite potentially high standard deviations (and high RMSE) per individual patch predictions, we can still make accurate overall predictions about the expected quality of a marker combination, including those that are unseen.
We also visually assess the individual effects of and in Fig. 4. is observed to focus on the edges of the vessels, where discrepancies in annotations often exist and thus fit the definition of aleatoric uncertainty of capturing noise inherent in the observations. Meanwhile, captures mistakenly segmented vessels which appear in but are not sinusoids. Such errors are related to lack of information in the model, and may be corrected by adding labeled training data containing patterns similar to those highlighted by .
4 Conclusion
With the proposed framework for inspection of segmentation uncertainties with missing FM markers, the advantages are twofold: First, accounting for epistemic and aleatoric uncertainties simultaneously produces segmentation results superior to the stateoftheart MSME
model. Second, with a uncertainty feature based regressor, we can estimate segmentation quality for any possible marker combination, whether it was seen during training or not. Thus, in practise we can quantitatively evaluate how suitable a trained model is for future samples stained with a previously unseen marker combination, i.e. to discard them or preferentially annotate them, e.g., in an active learning framework. Furthermore, comparison of different uncertainties may indicate whether to focus more on labeled data or on improving image acquisition quality. Note that the proposed methods are applicable also to other multimodality frameworks, such as MRI.
5 Compliance with Ethical Standards
6 Acknowledgments
Funding was provided by Hasler Foundation, Swiss National Science Foundation (SNSF), Swiss Cancer Research Foundation, and JuliusMüller Foundation.
References
 [1] Thorsten Falk et al., “Unet: deep learning for cell counting, detection, and morphometry,” Nature methods, vol. 16, no. 1, pp. 67–70, 2019.
 [2] Alvaro Gomariz et al., “Imaging and spatial analysis of hematopoietic stem cell niches,” Annals of the New York Academy of Sciences, vol. 1466, no. 1, pp. 5–16, 2020.
 [3] Alvaro Gomariz et al., “Modality attention and sampling enables deep learning with heterogeneous marker combinations in fluorescence microscopy,” arXiv preprint arXiv:2008.12380, 2020.

[4]
Alex Kendall and Yarin Gal,
“What uncertainties do we need in bayesian deep learning for computer vision?,”
in NeurIPS, 2017.  [5] Yuta Hiasa et al., “Automated muscle segmentation from clinical ct using bayesian unet for personalized musculoskeletal modeling,” IEEE Transactions on Medical Imaging, vol. 39, no. 4, pp. 1030–1040, 2019.
 [6] Nitish Srivastava et al., “Dropout: a simple way to prevent neural networks from overfitting,” The journal of machine learning research, vol. 15, no. 1, pp. 1929–1958, 2014.
 [7] Yarin Gal and Zoubin Ghahramani, “Dropout as a bayesian approximation: Representing model uncertainty in deep learning,” in ICML, 2016.
 [8] Yarin Gal and Zoubin Ghahramani, “Bayesian convolutional neural networks with bernoulli approximate variational inference,” in ICLR workshop track, 2016.
 [9] Diederik P Kingma and Jimmy Ba, “Adam: A method for stochastic optimization,” in ICLR, 2014.
 [10] Alvaro Gomariz et al., “Quantitative spatial analysis of haematopoiesisregulating stromal cells in the bone marrow microenvironment by 3d microscopy,” Nature communications, vol. 9, no. 1, pp. 1–15, 2018.
 [11] Stanislav Fort, Huiyi Hu, and Balaji Lakshminarayanan, “Deep ensembles: A loss landscape perspective,” 2020.