Log In Sign Up

Uncertainty Estimation for Heatmap-based Landmark Localization

Automatic anatomical landmark localization has made great strides by leveraging deep learning methods in recent years. The ability to quantify the uncertainty of these predictions is a vital ingredient needed to see these methods adopted in clinical use, where it is imperative that erroneous predictions are caught and corrected. We propose Quantile Binning, a data-driven method to categorise predictions by uncertainty with estimated error bounds. This framework can be applied to any continuous uncertainty measure, allowing straightforward identification of the best subset of predictions with accompanying estimated error bounds. We facilitate easy comparison between uncertainty measures by constructing two evaluation metrics derived from Quantile Binning. We demonstrate this framework by comparing and contrasting three uncertainty measures (a baseline, the current gold standard, and a proposed method combining aspects of the two), across two datasets (one easy, one hard) and two heatmap-based landmark localization model paradigms (U-Net and patch-based). We conclude by illustrating how filtering out gross mispredictions caught in our Quantile Bins significantly improves the proportion of predictions under an acceptable error threshold, and offer recommendations on which uncertainty measure to use and how to use it.


page 1

page 3

page 6

page 8

page 9


Integrating Spatial Configuration into Heatmap Regression Based CNNs for Landmark Localization

In many medical image analysis applications, often only a limited amount...

Modeling Annotation Uncertainty with Gaussian Heatmaps in Landmark Localization

In landmark localization, due to ambiguities in defining their exact pos...

Deep Quantile Aggregation

Conditional quantile estimation is a key statistical learning challenge ...

2D Wasserstein Loss for Robust Facial Landmark Detection

Facial landmark detection is an important preprocessing task for most ap...

Estimation and uncertainty quantification for extreme quantile regions

Estimation of extreme quantile regions, spaces in which future extreme e...

Can a single neuron learn quantiles?

A novel non-parametric quantile estimation method for continuous random ...

1 Introduction

Automatic landmark localization is an important step in many medical image analysis methods, such image as segmentation [1] and image registration [12, 7]. An erroneous landmark prediction at an early stage of analysis will flow downstream and compromise the validity of final conclusions. Therefore, the ability to quantify the uncertainty of a prediction is a vital requirement in a clinical setting where explainability is crucial [8] and there is a human in-the-loop to correct highly uncertain predictions [10].

In this study we propose Quantile Binning, a data-driven framework to estimate a prediction’s quality by learning the relationship between any continuous uncertainty measure and localization error. Using the framework, we place predictions into bins of increasing subject-level uncertainty and assign each bin a pair of estimated localization error bounds. The boundaries of the bins are derived from a trained landmark localization model. These bins can be used to identify the subsets of predictions with expected high or low localization errors, allowing the user to make a choice of which subset of predictions to use based on their expected error bounds. Our method is agnostic to the particular uncertainty metric used, as long as it is continuous and the true function between the uncertainty metric and localization error is monotonically increasing. We showcase our method using three uncertainty measures: a baseline derived from predicted heatmap activations, the gold standard of ensemble model prediction variance

[4], as well as introducing our own measure based on ensemble heatmap activations. Furthermore, we introduce two uncertainty evaluation methods, measuring how well an uncertainty measure truly predicts localization error and the accuracy of our predicted error bounds.

We explore the efficacy of our three uncertainty metrics on two paradigms of localization models: an encoder-decoder U-Net that regresses Gaussian heatmaps [21], and a patch-based network that generates a heatmap from patch voting, PHD-Net [23]. We compare how the same uncertainty measures perform under the two different approaches to landmark localization on two datasets of varying difficulty, finding promising results for both paradigms. Our Quantile Binning method is generalisable to any continuous uncertainty measure, and the examples we investigate in this study can be applied as a post-processing step to any heatmap-based landmark localization method. We aspire that this work can be used as a framework to build, evaluate and compare uncertainty metrics in landmark localization beyond those demonstrated in this paper.

2 Related Work

2.1 Landmark Localization

The recent advancement in machine learning has led to convolutional neural networks (CNNs) dominating the task of landmark localization. Encoder-decoder methods, originally proposed for the task of image segmentation

[21], have cemented themselves as one of the leading approaches for landmark localization in both the medical domain [19, 30, 25]

and computer vision

[24, 27, 2]

. The architecture of these methods allow the analysis of images at multiple resolutions, learning to predict a Gaussian heatmap centred around the predicted landmark location. The activation of each pixel in the heatmap can be interpreted as showing the pseudo-probability of that pixel being the target landmark. The network learns to generate a high response near the landmark, smoothly attenuating the responses in a small radius around it. Regressing heatmaps proves more effective than regressing coordinates

[29], as the heatmap image offers smoother supervision than direct coordinate values, and also models some uncertainty in the prediction.

Figure 1: Overview of our general Quantile Binning framework. a) We make a prediction using a heatmap-based landmark localization model, and b) extract a continuous uncertainty measure. c) We learn thresholds to categorise predictions into bins of increasing uncertainty, estimating error bounds for each bin. d) We filter out predictions from high uncertainty bins to improve the proportion of acceptable predictions. e) Finally, we evaluate each uncertainty measure’s ability to capture the true error quantiles and the accuracy of the estimated error bounds.

However, in medical imaging the number of available training samples is often small so the encoder-decoder network is forced to be shallow, compromising its performance [29]. One method to overcome this is via a patch-based approach; alleviating the problem by sampling many small “patches’ from an image, learning the relationship between each patch and the target landmark [5, 6]. This approach can generate orders of magnitude more training samples from a single image compared to the encoder-decoder style methods. Furthermore, patch-based models that use Fully Convolutional Networks (FCN) have fewer parameters than encoder-decoder architectures, decreasing computational requirements and training times [23].

Noothout et al. [18] implemented a patch-based network using an FCN to jointly perform classification and regression on each patch. The coarse binary classification task determines whether a patch contains the landmark, and the more precise regression task estimates the displacement from the patch to the landmark. This multi-task, joint learning leads to a light-weight network and enhanced localization performance, with the two tasks sharing a feature representation that improves the performance of both [22]. However, the resulting network has a strong local focus and is also susceptible to failure if the predicted patch from the classification task is incorrect. In a follow-up work, Noothout et al. [17] extended their work [18] into a two-stage method: they first train a CNN to provide global estimates for the landmarks, then employ specialised CNNs for each landmark for the final prediction. This method improves upon the first in terms of localization error, but has the drawback of requiring multiple training stages.

To mitigate the inherent local focus of the patch-based methods, we extended the patch-based network [18] by borrowing heatmap regression from the encoder-decoder networks; reforming the binary classification task as a Gaussian heatmap regression task [23]. Named PHD-Net (Patch Heatmap & Displacement), this smoother supervision improved performance, reducing misidentifications compared to using the classification branch from the prior work [18]. Furthermore, we introduced the method Candidate Smoothing, combining the features from the two branches to output more accurate predictions along with an uncertainty measure.

2.2 Uncertainty Estimation

Estimating the uncertainty of machine learning predictions is a topic of growing interest, particularly relevant in the domain of medical imaging where there is often a human in the loop to manually correct flagged predictions. A concentrated effort in uncertainty estimation has been applied to image segmentation by the community, a task similar to landmark localization that instead aims to predict a mask for an entire structure rather than a single point. Some of the most successful approaches use Bayesian approximation methods like Monte-Carlo dropout [16] or an ensemble of networks [14], using the variance of repeated predictions as an indicator for uncertainty. The gold standard approach across many studies is to use an ensemble of networks. This method affords better performance [14] and a more accurate mechanism for Bayesian marginalization [26] compared to a single model using Monte-Carlo dropout. However, the obvious drawback of ensembles is the need to train multiple models.

In landmark localization we are predicting a single point rather than a mask, but similar uncertainty estimation approaches can be utilised. However, there are limited works exploring uncertainty in landmark localization. Payer et al. [20] directly modeled uncertainty during training by learning the Gaussian covariances of target heatmaps, and predicting the distribution of likely locations of the landmark at test time. Lee et al. [13] borrowed from image segmentation approaches by proposing a Bayesian CNN that utilised Monte-Carlo dropout to predict the location and subject-level uncertainty of cephalometric landmarks. Another method to measure the subject-level confidence (the inverse of uncertainty) of a heatmap-based landmark prediction is to measure the maximum heatmap activation (MHA) of the predicted heatmap. Since the activation of a Gaussian heatmap at a particular pixel represents the pseudo-probability of the pixel being the landmark, we can use this pseudo-probability as a confidence measure: the higher the activation, the more certain the prediction. Drevicky et al. [4] compared MHA with ensemble and Monte-Carlo dropout methods, finding MHA surprisingly effective given its simplicity. However, similarly to image segmentation, they found using an ensemble of models was best at predicting uncertainty. They calculated the coordinate prediction variance between an ensemble of models, and found this method performed best at estimating prediction uncertainty.

In our earlier work utilising the patch-based model PHD-Net, MHA was also used as the uncertainty metric [23]. However, the heatmap analysed is distinctly different from the heatmaps predicted by encoder-decoder networks. Rather than explicitly learning a Gaussian function centred around the landmark, the approach combined patch-wise heatmap and displacement predictions. We produced a new non-Gaussian heatmap, where the activation of each pixel is defined by the number of patches that voted for it, regularised by the coarse global likelihood prediction. Therefore, the resulting heatmap represents patch-wise ensemble votes rather than a Gaussian function, where the MHA is the pixel with the most “patch votes”.

To the best of our knowledge, no study has investigated how heatmap-based uncertainty estimation measures can be used to filter out poor predictions in landmark localization. Furthermore, no general framework has been proposed to compare how well uncertainty measures can predict localization error - an important practical application in clinical settings.

3 Contributions

In this paper, we propose a general framework to compare and evaluate uncertainty measures for landmark localization. This work extends the analysis of MHA in [23], with more in depth experiments and comparisons. Our contributions, depicted in Fig. 1, are threefold:

  • We propose an ensemble-based method to extract landmark coordinates from an ensemble of heatmaps and estimate prediction uncertainty: Ensemble Maximum Heatmap Activation (E-MHA) (Fig. 1a, 1b, Sec. 4.2).

  • We propose Quantile Binning, a method to categorise predictions by any continuous uncertainty measure, and estimate error bounds for each bin (Fig. 1c, Sec. 4.3).

  • We construct two evaluation metrics for uncertainty estimation methods from Quantile Binning: 1) Similarity between predicted bins and true error quantiles; 2) Accuracy of estimated error bounds. (Fig. 1e, Sec. 4.4).

We demonstrate the impact of our contributions by using our proposed Quantile Binning to compare E-MHA to two existing coordinate extraction and uncertainty estimation methods: a baseline of Single Maximum Heatmap Activation (S-MHA), and a gold standard of Ensemble Coordinate Prediction Variance (E-CPV). In Sec. 6.2, we compare the baseline coordinate extraction performance of the three approaches, followed by the uncertainty estimation performance in Sec. 6.3. We explore the reach of heatmap-based uncertainty measures by demonstrating they are applicable to both U-Net regressed Gaussian heatmaps and patch-based voting heatmaps. We show each uncertainty measure can identify a subset of predictions with significantly lower mean error than the full set by filtering out predictions from high uncertainty bins (Fig. 1c). Finally, in Sec. 7.2 we make recommendations for which uncertainty measure to use, and how to use it.

We provide an open source implementation of this work along with the data to reproduce our results at

4 Methods

4.1 Landmark Localization Models

First, we briefly review the two models we use for landmark localization, allowing us to compare the generalisability of our uncertainty measures across different heatmap generation approaches. We implement a variation of the popular encoder-decoder networks that regresses Gaussian heatmaps, U-Net [21]. We also implement a patch-based method, PHD-Net [23], which produces a heatmap from patch votes.

4.1.1 Encoder-Decoder Model (U-Net)

The vast majority of state-of-the-art landmark localization approaches are based on the foundation of a U-Net style encoder-decoder architecture. The architecture of U-Net follows a “U” shape, first extracting features at several downsampled resolutions, before rebuilding to the original dimensionality in a symmetrical upsampling path. Skip connections are employed between each level, preserving spatial information. The rationale behind the architecture design is to inject some inductive bias into the model architecture itself, helping it learn the local characteristics of each landmark, while preserving the global context.

Rather than regressing coordinates directly, the objective of the model is to learn a Gaussian heatmap image for each landmark, with the centre of the heatmap on the target landmark. For each landmark with 2D coordinate position , the 2D heatmap image is defined as the 2D Gaussian function:


The network learns weights and biases to predict the heatmap . During inference, we can interpret the activation of each pixel in the predicted heatmap as the pseudo-probability of that pixel being the landmark. We will exploit this in our uncertainty estimation methods.

4.1.2 Patch-based Model (PHD-Net)

Patch-based models use a Fully Convolutional Network (FCN), with the architecture resembling the first half of an encoder-decoder architecture. Therefore, they are more light-weight than encoder-decoder networks, with significantly less parameters leading to faster training.

In our earlier work, we proposed PHD-Net: a multi-task patch-based network [23], building on the work by Noothout et al. [18]. We incorporate a variant of the heatmap objective function from encoder-decoder networks into the objective function, predicting the 2D displacement from each patch to the landmark alongside the coarse Gaussian pseudo-probability of each patch.

PHD-Net aggregates the patch-wise predictions to obtain a heatmap by plotting candidate predictions from the displacement branch as small Gaussian blobs, then regularising the map by the upsampled Gaussian from the heatmap branch.

Again, we can consider the activation of each pixel in heatmap as an indicator for uncertainty, where instead of the psuedo-probability, the activation represents the amount of “patch votes”.

4.1.3 Ensemble Models

Using an ensemble of models is more robust than using a single model, as it reduces the effect of a single model becoming stuck in a local minima during training. Furthermore, we can use the variance in the predictions of each model to estimate the uncertainty of the prediction [4]. We use an ensemble of models where each model is identical, except the random initialisation of parameters.

4.2 Estimating Uncertainty and Coordinate Extraction

Although generated differently, we hypothesise both U-Net and PHD-Net produce heatmaps containing useful information to quantify a prediction’s uncertainty - but are they equally effective? To this end, we compare the performance of both models under three uncertainty estimation methods: a baseline approach, our proposed approach extending the baseline to an ensemble of networks, and the gold standard approach. Each method extracts coordinate values from the predicted heatmap, and estimates the prediction’s uncertainty.

4.2.1 Single Maximum Heatmap Activation (S-MHA)

We introduce the baseline coordinate extraction and uncertainty measure. We use the standard method to obtain the predicted landmark’s coordinates from the predicted heatmap , by finding the pixel with the highest activation:


We hypothesise that the pixel activation at the coordinates can describe the model’s uncertainty: the higher the activation, the lower the uncertainty, and the lower the prediction error. However, due to this inverse relationship, this measures “confidence”, not uncertainty.

We transform our confidence metric to an “uncertainty” metric , by applying the following transformation to the pixel activation at the predicted landmark location:


where is a small constant scalar that prevents . Now, as the pixel activation at increases, decreases.

We call the transformed activation of this peak pixel Single Maximum Heatmap Activation (S-MHA). This is a continuous value bounded between for U-Net, and bounded between for PHD-Net, where is the number of patches. The lower the S-MHA, the lower the uncertainty.

4.2.2 Ensemble Maximum Heatmap Activation (E-MHA)

In this work we extend the S-MHA uncertainty measure to ensemble models. We hypothesise that E-MHA should hold a stronger correlation with error than S-MHA due to the additional robustness an ensemble of models affords. We generate the mean heatmap of the models in the ensemble, and obtain the predicted landmark coordinates as the pixel with the highest activation:


Again, we hypothesise the activation of this pixel correlates with model confidence. Similar to S-MHA, we inverse the pixel activation and add a small to the activation of to give us our uncertainty measure, :


E-MHA is a continuous value constrained to the same bounds as S-MHA. This is a form of late feature fusion, combining features from all models before a decision is made.

4.2.3 Ensemble Coordinate Prediction Variance (E-CPV)

We also implement the gold standard of uncertainty estimation: ensemble coordinate prediction variance [4]. The more the models disagree on where the landmark is, the higher the uncertainty.

To extract a landmark’s coordinates we first use the traditional S-MHA coordinate extraction method on each of the models’ predicted heatmaps. Then, we use decision-level fusion to calculate the mean coordinate of the individual predictions to compute the final coordinate predictions :


We generate the Ensemble Coordinate Prediction Variance (E-CPV) by calculating the mean absolute difference between the model predictions and :


This is a continuous value bounded between , where and are the height and width of the original image, respectively. The more the models disagree on the landmark location, the higher the coordinate prediction variance, and the higher the uncertainty.

Unlike S-MHA and E-MHA, this metric completely ignores the value of the heatmap activations, potentially losing some useful uncertainty information by opting for fusion at the decision-level.

4.3 Quantile Binning: Categorising Predictions by Uncertainty and Estimating Error Bounds

We leverage the described uncertainty measures to inform the subject-level uncertainty of any given prediction, i.e. is the model’s prediction likely to be accurate, or inaccurate based on this uncertainty value? We propose a data-driven approach, Quantile Binning, using a hold-out validation set to establish thresholds delineating varying levels of uncertainty specific to each trained model. We use these learned thresholds to categorise our predictions into bins and estimate error bounds for each bin. We opt for a data-driven approach compared to a rule-based approach for two reasons: 1) When models are constrained to a limited dataset (in the order of hundreds of training samples), they can have difficulty converging well, particularly in the task of localizing a difficult landmark. 2) If the limited dataset contains high variance, different training sets can lead to confoundingly different learned weights under the same model architecture and training schedule. Therefore, establishing a set of thresholds for each model is more invariant to training differences compared to using the same thresholds for all models.

Quantile Binning is application agnostic; applicable to any data as long as it consists of continuous tuples of .

In this paper, we generate these pairings after the landmark localization model is trained. We use a hold-out validation set and make coordinate predictions and uncertainty estimates using each of our three uncertainty measures described in Section 4.2. Since we have the ground truth annotations of the validation set we can produce continuous tuples for each uncertainty measure.

4.3.1 Establishing Quantile Thresholds

We aim to categorise predictions using our continuous uncertainty metrics into bins. We make the following assumption: The true function between a good uncertainty measure and localization error is monotonically increasing (i.e. the higher the uncertainty, the higher the error).

Quantile binning is a non-parametric method that fits well with these assumptions - a variant of histogram binning which is commonly used for calibration of predictive models [9, 15]

. By considering the data in quantiles rather than intervals, we can better capture a skewed distribution as the outliers in the tail of the distribution can be grouped into the same group. In other words,

quantiles divide the probability distribution into areas of approximately equal probability.

This property allows us to interrogate model-specific (epistemic) uncertainties. Rather than compute uncertainty thresholds based on predefined error thresholds for each bin, we use Quantile Binning to create thresholds that group our samples in relative terms. This enables the user to flag the worst of predictions. We describe the steps below.

First, for any given uncertainty measure we sort our validation set tuples in ascending order of their uncertainty value and sequentially group them into equal-sized bins . We assign each bin a pair of boundaries defined by the uncertainty values at the edges of the bin to create an interval: . To capture all predictions at the tail ends of the distribution, we set , and .

During inference, we use these boundaries to bin our predictions into bins (), with uncertainty increasing with each bin. For each predicted landmark with uncertainty where , is binned into . The distribution of samples should be uniform across the bins, due to the quantile method we used to obtain thresholds.

The higher the value of , the more fine-grained we can categorise our uncertainty estimates. However, as increases the method becomes more sensitive to any miscalibration of the uncertainty measure, leading to less accurate prediction binnings.

This method is agnostic of the scale used in the uncertainty measure, and therefore is applicable to all our defined uncertainty measures.

4.3.2 Estimating Error Bounds using Isotonic Regression

Establishing thresholds has allowed us to filter predictions by uncertainty in relative terms, but we lack a method to estimate absolute localization error for each bin. For example, for an easy landmark, the samples in may have a very low localization error in absolute terms, but for a more difficult landmark even the lowest relative uncertainty samples in may have a high error. Therefore, in order to offer users information about the expected error for each group, we present a data-driven approach to predict error bounds.

First, we estimate the true function between the uncertainty measure and localization error using our hold-out validation set. However, since the validation tuples represent a small random sample of the true distribution, plotting this relationship may be noisy. Therefore, to make the best approximation of our assumed true function, we use isotonic regression to fit a monotonically increasing line between uncertainty and localization error, using our validation tuples. Isotonic regression is a method to fit a free-form, non-decreasing line to a set of observations, also commonly used for predictive model calibration [28, 9]. It is non-parametric, so can learn the true distribution if given enough i.i.d. data. The regression seeks a weighted least squares fit subject to the constraint that :


where and is the number of pairs. In our case, the observations , are the tuples.

Next, we use our isotonically regressed line to estimate error bounds for each of our quantile bins. Since we are interested in error bounds that only increase with uncertainty, we discard the potentially noisy observed tuples from our validation set and instead predict error bounds from the uncertainty values using the fitted function. For each bin’s threshold interval [, ), we estimate the expected error interval: [. We use these values as the estimated lower and upper error bounds of the predictions in bin . Note, that for we only estimate an upper bound, and for we only estimate a lower bound.

In summary, we use a data-driven approach to learn thresholds to progressively filter predictions at inference into bins of increasing uncertainty, and assign each bin estimated error bounds.

4.4 Evaluation Metrics for Uncertainty Measures

In this subsection we construct methods to evaluate how well an uncertainty measure’s predicted bins represent the true error quantiles, and how accurate each bin’s estimated error bounds are.

4.4.1 Evaluating the Predicted Bins

A good uncertainty measure will have a strong correlation with localization error. Therefore, it should provide quantile thresholds that correspond to the true error quantiles. For example, since Bin contains the predictions with the uncertainties at the lowest quantile, the localization errors of the predictions in should be the lowest quantile of the test set. This can be generalised to each group, until , which should contain the errors in the quantile.

To evaluate this desired property, we propose to measure the similarity between each predicted bin and its respective theoretically perfect bin.

We create the ground truth (GT) bins by ordering the test set samples in ascending order of error. Then, we sequentially bin them into equally sized bins: .

For each predicted and GT bin pair &

, we calculate the Jaccard Index (JI) between them and report the mean measure of each bin across all folds:


The higher the JI, the better the uncertainty measure has binned predictions by localization error. Therefore, it follows that the higher the JI, the better the uncertainty measure predicts localization error.

4.4.2 Accuracy of Estimated Error bounds

A good uncertainty measure will have a monotonically increasing relationship with localization error. Therefore, estimating the true function using isotonic regression should provide accurate error bound estimations.

To measure this, for each bin , we calculate the percentage of predictions whose error falls between the estimated error bound interval, . The higher the percentage, the higher the accuracy of our estimated upper error bounds.

(a) 4CH landmarks.
(b) SA landmarks.
Figure 2: (a) Landmarks for 4 chamber (4CH) CMR: Magenta = tricuspid valve; Yellow = mitral valve; Red = apex of left ventricle. (b) Landmarks for Short Axis (SA) CMR: Magenta = superior right ventricle insertion point valve; Yellow = inferior right ventricle insertion point; Red = inferior lateral reflection of right ventricle free wall.

5 Datasets

We perform our experiments using a dataset from the ASPIRE Registry [11], with Cardiac Magnetic Resonance Imaging (CMR) sequences containing a mix of subjects suffering from pulmonary arterial hypertension (PAH) and no pulmonary hypertension (PH). Each subject has a four chamber (4CH) view and/or a short axis view (SA). Each CMR sequence has a spatial resolution of pixels, where each pixel represents 0.9375mm of the organ, and 20 frames (we use only the first frame for landmark localization in this study). There are 303 SA images, each with three annotated landmarks: the inferior right ventricle insertion point (infSA), the superior right ventricle insertion point (supSA), and the inferior lateral reflection of the right ventricle free wall (RVSA). There are 422 4CH images, each with three annotated landmarks: the apex of the left ventricle at end diastole (LVDEV Apex), the mitral valve (mitral), and tricuspid valve (tricuspid). The 4CH dataset represents a more challenging landmark localization task as the images have much higher variability than the SA dataset. The landmarks were decided and manually labelled by a radiologist, as shown in Fig. 2. For this study, we consider the SA images the EASY dataset, and the 4CH images the HARD dataset.

6 Experiments and Results

First, in Sec. 6.2 we present the baseline landmark localization performance of PHD-Net and U-Net over both SA and 4CH datasets using the S-MHA, E-CPV, and E-MHA methods to extract coordinates. This gives us a comparison of the coordinate extraction performance from each of our methods, and a baseline to measure the effectiveness of each method’s uncertainty estimation. Second, in Sec. 6.3 we interrogate how using Quantile Binning with our uncertainty measures delineates predictions in terms of their localization error, and compare the predicted bins to the ground truth error quantiles. We show a practical example of how filtering out highly uncertain predictions can dramatically increase the proportion of acceptable localization predictions. Finally, in Sec. 6.4 we assess how well the uncertainty measures can predict error bounds for each bin. When comparing between we use an unpaired -test () to test for significance. When comparing uncertainty metrics among the same Bin category and model, we use a paired -test () to test for significance.

6.1 Experimental Setup

4 Chamber Images Short Axis Images
Method U-Net PHD-Net U-Net PHD-Net
S-MHA All 10.00 18.99 11.07 21.33 5.86 14.19 3.58 3.52
S-MHA 6.79 6.09 5.80 9.03 3.62 2.45 2.78 1.99
E-MHA All 6.36 8.01 9.14 18.11 4.37 8.86 3.36 3.50
E-MHA 4.93 2.85 4.70 3.21 2.98 2.09 2.39 1.90
E-CPV All 8.13 10.16 9.42 13.07 4.97 7.51 3.22 2.93
E-CPV 5.34 3.00 5.10 6.76 3.75 2.13 2.47 2.08
Table 1: Localization errors (mm) for the uncertainty methods outlined. All indicates entire set of predictions; indicates subset with the lowest uncertainties

. Mean error and standard deviation are reported across all folds & all landmarks.

Bold indicates best results in row for the given dataset.

We split both datasets into 8 folds, and perform 8-fold cross validation for both U-Net and PHD-Net. For each iteration, we select one fold as our testing set, one our hold-out validation set and the remaining 6 as our training set. We select for the ensemble methods, training 5 separate models at each fold. We chose to compromise with computational constraints, asserting that 5 models are representative to compare the uncertainty methods for our purposes. Each model is identical apart from randomly initialised weights. We randomly select a model for our S-MHA uncertainty measure. For our Quantile Binning method, we select for 5 bins, balancing the constraints of our small datasets with a useful number of uncertainty categories.

We implement our U-Net model [21] using the MONAI package [3]. We design the model with 5 encoding-decoding levels, creating 1.63M learnable parameters. We modify the objective function from image segmentation to simultaneous landmark localization, minimising the mean squared error between the target and predicted heatmaps. We use the full pixel image as input, and learn heatmaps of the same size. We generate target heatmaps using Eq. (1

) with a standard deviation of 8, and train for 1000 epochs with a batch size of 2, and a learning rate of 0.001 using the Adam Optimiser (settings from


We implement our PHD-Net model following [23], creating a model with 0.06M learnable parameters. For all experiments we trained PHD-Net for 1000 epochs using a batch size of 32 and a learning rate of 0.001, using the Adam Optimiser. We train one landmark at a time. Note, the only difference in setup from [23] in this work is different fold splits and training for an additional 500 epochs (same as U-Net) with no early stopping, since we now use our validation set for Quantile Binning.

6.2 Baseline Landmark Localization Performance

Figure 3 and the All rows in Table 1 show the baseline performance for U-Net and PHD-Net at localizing landmarks in our 4CH and SA datasets. We make the following observations:

  • When considering the entire set of landmarks (All), performance is better on the SA dataset for both models, with PHD-Net outperforming U-Net. On the 4CH dataset, U-Net outperforms PHD-Net, suggesting the higher capacity model of U-Net is more robust to datasets with large variations.

  • Simply using a single model with our S-MHA strategy is predictably less robust than ensemble approaches.

  • E-MHA outperforms the previous gold standard of E-CPV for coordinate extraction. However, does it outperform E-CPV in terms of uncertainty estimation? We explore this in Section 6.3.

  • The standard deviation in the error for the baseline All results in Table 1 is high for all models. We aspire to catch these bad predictions using quantile binning in Sec. 6.3.

(a) SA dataset.
(b) 4CH dataset.
Figure 3: Cumulative distribution of localization errors across the entire set of predictions, showing the % of predictions under a given error threshold. The vertical line is the acceptable error threshold, chosen by a radiologist. It shows results for all folds & landmarks using the uncertainty methods’ coordinate extraction. Higher percentage is better.
(a) Localization error for each Bin - 4CH dataset (Lower is better).
(b) Localization error for each Bin - SA dataset (Lower is better).
(c) Jaccard Index for each Bin - 4CH dataset (Higher is better).
(d) Jaccard Index for each Bin - SA dataset (Higher is better).
(e) Estimated Error Bound Accuracies- 4CH dataset (Higher is better).
(f) Estimated Error Bound Accuracies- SA dataset (Higher is better).
Figure 4: Results from Quantile Binning for U-Net and PHD-Net across all landmarks & folds, using our three coordinate extraction & uncertainty estimation methods. Bins are in descending order of uncertainty ( highest uncertainty, lowest uncertainty). (a) and (b) show the mean localization error of each bin, with error decreasing as we move towards the bins with lower uncertainty. (c) and (d) present the Jaccard Index, showing how similar the predicted bins are to the ground truth error quantiles. (e) and (f) visualise the estimated error bound accuracy, showing the percentage of predictions within the estimated error bounds for each bin.

6.3 Analysis of the Predicted Quantile Bins

We apply quantile binning to each uncertainty measure: S-MHA, E-MHA and E-CPV. We compare results over U-Net and PHD-Net for both the SA and 4CH datasets.

We found the most useful information is at the tail ends of the uncertainty distributions. Figs. 3(c) & 3(d) plot the Jaccard Index between ground truth error quantiles and predicted error quantiles. The predictions in the highest uncertainty quantile () is significantly better at capturing the correct subset of predictions than the intermediate bins (). Similarly, in most cases the bin representing the lowest uncertainties () had a significantly higher Jaccard Index than the intermediate bins, but still lower than . Figs. 3(a) & 3(b) show the mean error of the samples of each quantile bin over both datasets. The most significant reduction in localization error is from to for all uncertainty measures, further indicating our uncertainty measures are well calibrated to filter out gross mispredictions. These findings suggest that most of the utility in the uncertainty measures investigated can be found at the tail ends of the scale. This is an intuitive finding, as the predictions in are certainly uncertain, and the predictions in are certainly certain.

The worse trained the landmark localization model, the more useful the uncertainty measure. Table 1 shows the localization error of all methods, models and datasets for the entire set (All) and lowest uncertainty subset () of predictions. PHD-Net’s baseline localization performance on the 4CH dataset was worse than U-Net. However, when we consider the lowest uncertainty subset of predictions (), PHD-Net sees a 47% average reduction in error from all predictions (All), compared to U-Net’s average reduction of 30%. Similarly, U-Net performed worse than PHD-Net for the SA dataset, but saw an average error reduction of 31% compared to PHD-Net’s 25%. This suggests that all investigated uncertainty measures are more effective at identifying gross mispredictions when models are poorly trained.

Using heatmap-based uncertainty measures is generalisable across heatmap generation approaches. The bin similarities in Figs. 3(d) & 3(c) show that using S-MHA and E-MHA yields similar performance with PHD-Net and U-Net, despite their different heatmap derivations. Surprisingly using E-MHA does not give a significant increase in bin similarity compared to S-MHA, suggesting the thresholds remain relatively stable across models.

No investigated method is conclusively best for estimating uncertainty in all scenarios. For the more challenging 4CH data, Fig. 3(c) shows E-CPV is significantly better than S-MHA and E-MHA for both models at capturing the true error quantiles, corroborating the findings of [4]. E-CPV is particularly good at identifying the worst predictions (). For the easier SA data, no method has a significantly higher Jaccard Index. Therefore, when we generalise across both models and datasets, all uncertainty measures fared broadly similar on average in terms of error reduction between the entire set and the subset of predictions. S-MHA had an average error reduction of 35.07%, E-MHA 32.94% and E-CPV 32%.

Despite similar performances in uncertainty estimation, we found E-MHA yields the greatest localization performance overall. Table 1 shows E-MHA offers the best localization performance for across both datasets and models. This is due to the combination of offering the most robust coordinate extraction on average (Fig. 3), and similar uncertainty estimation performance (Fig. 3(c), Fig. 3(d)). We more concretely demonstrate Quantile Binning’s ability to identify low uncertainty predictions in Fig. 5. We clearly observe a significant increase in the percentage of images below the acceptable error threshold of 5mm when considering only predictions in - with E-MHA giving the greatest proportion of acceptable predictions. For both datasets we observe that PHD-Net using E-MHA has a higher proportion of acceptable predictions compared to U-Net. If we consult Table 1, we also observe that PHD-Net’s predictions in bin indeed have lower mean localization error than U-Net’s corresponding bin for both datasets, with E-MHA offering the lowest average localization error.

(a) PHD-Net - 4CH Images.
(b) U-Net - 4CH Images.
(c) PHD-Net - SA Images.
(d) U-Net - SA Images.
Figure 5: Cumulative distribution of localization errors showing the % of predictions under a given error threshold, comparing all predictions (All) to the lowest uncertainty subset () for the uncertainty methods across all folds & landmarks. The vertical line is the acceptable error threshold, chosen by a radiologist. Higher percentage is better.

6.4 Analysis of Error Bound Estimation

We analyse how accurate the isotonically regressed estimated error bounds are for our quantile bins. Figs. 3(e) & 3(f) show the percentage of samples in each bin that fall between the estimated error bounds.

We found we can predict the error bounds for the two extreme bins better than the intermediate bins. Figs. 3(e) & 3(f) show a similar pattern to the Jaccard Index Figs. 3(c) & 3(d), with the two extreme bins and predicting error bounds significantly more accurately than the inner bins. Again, this indicates the most useful uncertainty information is present at the extremes of the uncertainty distribution, with the predicted uncertainty-error function unable to capture a consistent relationship for the inner quantiles. This finding is intuitive, as it is easier to put a lower error bound on the most uncertain quantile of predictions or upper bound on the most certain predictions, than it is to assign tighter error bounds on middling uncertainty values.

We also found that a well defined upper bound for heatmap activations is important for error bound estimates. For both the 4CH and SA datasets, S-MHA for PHD-Net is significantly more accurate at predicting error bounds for the highest uncertainty quantile compared to the lowest uncertainty quantile (56% & 72% compared to 30% & 27% for 4CH & SA, respectively), correlating with S-MHA capturing a greater proportion of those bins (Jaccard Indexes of 32% & 24% compared to 16% & 15%). On the other hand, U-Net using S-MHA predicts error bounds for low uncertainty bins better than high uncertainty bins. This suggests that although PHD-Net’s heatmap activation is a robust predictor of gross mispredictions, the less tight upper bound of its heatmap activations make it hard to make an accurate prediction for the lowest uncertainty quantile (). This is alleviated by using an ensemble of networks in E-MHA, where the bound accuracy is improved to 62%.

E-MHA and E-CPV are more consistent than S-MHA. Overall, there is no significant difference between the error bound estimation accuracy of E-MHA and S-MHA, but Figs. 3(e) & 3(f) show E-MHA has less variation in performance between U-Net and PHD-Net compared to S-MHA, suggesting an ensemble of models is more robust. For the 4CH dataset, PHD-Net using E-CPV is on average significantly more accurate at predicting error bounds than S-MHA and E-MHA. However, there are no significant differences for PHD-Net on the easier SA dataset, nor U-Net on either dataset. There are also no significant differences between U-Net and PHD-Net in error bound estimation accuracy, with each method broadly equally effective for both models.

7 Discussion and Conclusion

7.1 Summary of Findings

This paper presented a general framework to assess any continuous uncertainty measure in landmark localization, demonstrating its use on three uncertainty metrics and two paradigms of landmark localization model. We introduced a new coordinate extraction and uncertainty estimation method, E-MHA, offering the best baseline localization performance and competitive uncertainty estimation.

Our experiments indicate that both heatmap-based uncertainty metrics (S-MHA, E-MHA), as well as the gold standard coordinate variance uncertainty metric (E-CPV) are applicable to both U-Net and PHD-Net. Despite the two models’ distinctly different approaches to generating heatmaps, using the maximum heatmap activation as an indicator for uncertainty is effective for both models. We showed that all investigated uncertainty metrics were effective at filtering out the gross mispredictions () and identifying the 20% most certain predictions (), but could not capture useful information for the intermediate uncertainty bins (-).

Our experiments also showed that E-MHA and S-MHA had a surprisingly similar ability to capture the true error quantiles of the best and worst 20% of predictions (Figs. 3(c) & 3(d)), but E-MHA was more consistent with its performance predicting the error bounds of those bins across models (Figs. 3(e) & 3(f)). This suggests that the calibration of the head and tail ends of the heatmap distributions is stable across our ensemble of models, but susceptible to variance when fitting our isotonically regressed line to predict error bounds. On the more challenging 4CH dataset, E-CPV broadly remained the gold standard for filtering out the worst predictions, but this trend did not continue in the easier SA dataset (Fig 5).

In terms of error bound estimation, we found a strong correlation with the bin’s Jaccard Index to the true quantiles. Bins and could offer good error bound estimates, but the intermediate bins could not (Figs. 3(e) & 3(f)). We found all uncertainty methods performed broadly the same: effective at predicting error bounds for and , but poor at predicting error bounds for -. The one exception was PHD-Net using S-MHA, which could not accurately predict error bounds for due to the high variance in pixel activations of highly certain predictions.

Overall, we found that using E-MHA provided the most robust coordinate extraction method of all methods, showing the best baseline localization error. When considering only the predictions with the lowest uncertainty (), using E-MHA achieves the lowest mean localization error accross all models and datasets.

7.2 Recommendations

In terms of utility, when resources are available to train an ensemble of models, we recommend to use E-MHA as the coordinate extraction and uncertainty estimation method. E-MHA offers the best baseline localization performance with a sufficient ability to filter out the gross mispredictions () and identify the most certain predictions (). It can also sufficiently estimate error bounds for these two bins. However, between these thresholds the uncertainty metric is poorly calibrated to bin predictions in a robust way, or accurately predict error bounds. If resources are constrained, S-MHA is surprisingly effective at capturing the true error quantiles for bins and , but note that when using a patch-based voting heatmap that is not strictly bounded, the error bound estimation for is not robust.

7.3 Conclusion

Beyond the above recommendations, we hope the framework described in this paper can be used to assess refined or novel uncertainty metrics for landmark localization, and act as a baseline for future work. Furthermore, we have shown that both the voting derived heatmap of PHD-Net, and the regressed Gaussian heatmap of U-Net can be exploited for uncertainty estimation. In this paper, we only explored the activation of the peak pixel, but it is likely that more informative measures can be extracted from the broader structure of the heatmap, promising greater potential for uncertainty estimation in landmark localization waiting to be uncovered.


  • [1] R. Beichel, H. Bischof, F. Leberl, and M. Sonka (2005) Robust active appearance models and their application to IEEE trans. med. imag.. IEEE Trans. Med. Imag. 24 (9), pp. 1151–1169. Cited by: §1.
  • [2] A. Bulat and G. Tzimiropoulos (2018)

    Super-FAN: integrated facial landmark localization and super-resolution of real-world low resolution faces in arbitrary poses with GANS


    n Proc. IEEE Conf. Comput. Vis. Pattern Recognit

    pp. 109–117. Cited by: §2.1.
  • [3] Project MONAI Note: Available at External Links: Document, Link Cited by: §6.1.
  • [4] D. Drevickỳ and O. Kodym (2020) Evaluating deep learning uncertainty measures in cephalometric landmark localization.. In BIOIMAGING, pp. 213–220. Cited by: §1, §2.2, §4.1.3, §4.2.3, §6.3.
  • [5] O. Emad, I. A. Yassine, and A. S. Fahmy (2015) Automatic localization of the left ventricle in cardiac MRI images using deep learning. In Proc. EMBC, pp. 683–686. Cited by: §2.1.
  • [6] L. et al (2018) Fast multiple landmark localisation using a patch-based iterative network. In Proc. MICCAI, pp. 563–571. Cited by: §2.1.
  • [7] M. et al (2011) Semi-automatic construction of reference standards for evaluation of image registration. Medical Image Analysis 15 (1), pp. 71–84. Cited by: §1.
  • [8] D. Gunning, M. Stefik, J. Choi, T. Miller, S. Stumpf, and G. Yang (2019)

    XAI-Explainable artificial intelligence

    Science robotics 4 (37) (eng). External Links: ISSN 2470-9476 Cited by: §1.
  • [9] C. Guo, G. Pleiss, Y. Sun, and K. Q. Weinberger (2017)

    On calibration of modern neural networks

    In International Conference on Machine Learning, pp. 1321–1330. Cited by: §4.3.1, §4.3.2.
  • [10] A. Holzinger (2016) Interactive machine learning for health informatics: When do we need the human-in-the-loop?. Brain Informatics 3 (2), pp. 119–131. Cited by: §1.
  • [11] Hurdman et al. (2012) ASPIRE registry: assessing the spectrum of pulmonary hypertension identified at a REferral centre. European Respiratory Journal 39 (4), pp. 945–955. Cited by: §5.
  • [12] H. J. Johnson and G. E. Christensen (2002) Consistent landmark and intensity-based image registration. IEEE Trans. Med. Imag. 21 (5), pp. 450–461. Cited by: §1.
  • [13] J. Lee, H. Yu, M. Kim, J. Kim, and J. Choi (2020) Automated cephalometric landmark detection with confidence regions using bayesian convolutional neural networks. BMC Oral Health 20 (1), pp. 1–10. Cited by: §2.2.
  • [14] A. Mehrtash, W. M. Wells, C. M. Tempany, P. Abolmaesumi, and T. Kapur (2020) Confidence calibration and predictive uncertainty estimation for deep medical image segmentation. IEEE Trans. Med. Imag. 39 (12), pp. 3868–3878. Cited by: §2.2.
  • [15] M. P. Naeini, G. F. Cooper, and M. Hauskrecht (2015) Obtaining well calibrated probabilities using bayesian binning. Twenty-Ninth AAAI Conference on Artificial Intelligence 2015, pp. 2901–2907. Cited by: §4.3.1.
  • [16] T. Nair, D. Precup, D. L. Arnold, and T. Arbel (2020) Exploring uncertainty measures in deep networks for multiple sclerosis lesion detection and segmentation. IEEE Trans. Med. Imag. 59, pp. 101557. Cited by: §2.2.
  • [17] J. M. H. Noothout, B. D. De Vos, J. M. Wolterink, E. M. Postma, P. A. M. Smeets, R. A. P. Takx, T. Leiner, M. A. Viergever, and I. Išgum (2020) Deep learning-based regression and classification for automatic landmark localization in medical images. IEEE Trans. Med. Imag. 39 (12), pp. 4011–4022. External Links: Document Cited by: §2.1.
  • [18] Noothout et al. (2018) CNN-based landmark detection in cardiac CTA scans. In MIDL Amsterdam, pp. 1–11. Cited by: §2.1, §2.1, §4.1.2.
  • [19] C. Payer, D. Štern, H. Bischof, and M. Urschler (2019) Integrating spatial configuration into heatmap regression based CNNs for landmark localization. IEEE Trans. Med. Imag. 54, pp. 207–219. Cited by: §2.1.
  • [20] C. Payer, M. Urschler, H. Bischof, and D. Štern (2020) Uncertainty estimation in landmark localization based on gaussian heatmaps. In Uncertainty for Safe Utilization of Machine Learning in Medical Imaging, and Graphs in Biomedical Image Analysis, pp. 42–51. Cited by: §2.2.
  • [21] O. Ronneberger, P. Fischer, and T. Brox (2015) U-Net: convolutional networks for biomedical image segmentation. In Proc. MICCAI, pp. 234–241. Cited by: §1, §2.1, §4.1, §6.1.
  • [22] S. Ruder (2017) An overview of multi-task learning in deep neural networks. arXiv preprint arXiv:1706.05098. Cited by: §2.1.
  • [23] L. Schobs, S. Zhou, M. Cogliano, A. J. Swift, and H. Lu (2021) Confidence-quantifying landmark localisation for cardiac MRI. In IEEE Int. Symp. Biomed. Imag., pp. 985–988. Cited by: §1, §2.1, §2.1, §2.2, §3, §4.1.2, §4.1, §6.1, §6.1.
  • [24] Z. Tang, X. Peng, S. Geng, L. Wu, S. Zhang, and D. Metaxas (2018) Quantized densely connected U-Nets for efficient landmark localization. In Proc. ECCV, pp. 339–354. Cited by: §2.1.
  • [25] N. Torosdagli, D. K. Liberton, P. Verma, M. Sincan, J. S. Lee, and U. Bagci (2018) Deep geodesic learning for segmentation and anatomical landmarking. IEEE Trans. Med. Imag. 38 (4), pp. 919–931. Cited by: §2.1.
  • [26] A. G. Wilson and P. Izmailov (2020) Bayesian deep learning and a probabilistic perspective of generalization. arXiv preprint arXiv:2002.08791. Cited by: §2.2.
  • [27] J. Yang, Q. Liu, and K. Zhang (2017) Stacked hourglass network for robust facial landmark localisation. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit, pp. 79–87. Cited by: §2.1.
  • [28] B. Zadrozny and C. Elkan (2002)

    Transforming classifier scores into accurate multiclass probability estimates

    In Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 694–699. Cited by: §4.3.2.
  • [29] J. Zhang, M. Liu, and D. Shen (2017) Detecting anatomical landmarks from limited medical imaging data using two-stage task-oriented deep neural networks. IEEE Trans. on Imag. Processing 26 (10), pp. 4753–4764. Cited by: §2.1, §2.1.
  • [30] Z. Zhong, J. Li, Z. Zhang, Z. Jiao, and X. Gao (2019) An attention-guided deep regression model for landmark detection in cephalograms. In Proc. MICCAI, pp. 540–548. Cited by: §2.1.