1 Introduction
Automatic landmark localization is an important step in many medical image analysis methods, such image as segmentation [1] and image registration [12, 7]. An erroneous landmark prediction at an early stage of analysis will flow downstream and compromise the validity of final conclusions. Therefore, the ability to quantify the uncertainty of a prediction is a vital requirement in a clinical setting where explainability is crucial [8] and there is a human intheloop to correct highly uncertain predictions [10].
In this study we propose Quantile Binning, a datadriven framework to estimate a prediction’s quality by learning the relationship between any continuous uncertainty measure and localization error. Using the framework, we place predictions into bins of increasing subjectlevel uncertainty and assign each bin a pair of estimated localization error bounds. The boundaries of the bins are derived from a trained landmark localization model. These bins can be used to identify the subsets of predictions with expected high or low localization errors, allowing the user to make a choice of which subset of predictions to use based on their expected error bounds. Our method is agnostic to the particular uncertainty metric used, as long as it is continuous and the true function between the uncertainty metric and localization error is monotonically increasing. We showcase our method using three uncertainty measures: a baseline derived from predicted heatmap activations, the gold standard of ensemble model prediction variance
[4], as well as introducing our own measure based on ensemble heatmap activations. Furthermore, we introduce two uncertainty evaluation methods, measuring how well an uncertainty measure truly predicts localization error and the accuracy of our predicted error bounds.We explore the efficacy of our three uncertainty metrics on two paradigms of localization models: an encoderdecoder UNet that regresses Gaussian heatmaps [21], and a patchbased network that generates a heatmap from patch voting, PHDNet [23]. We compare how the same uncertainty measures perform under the two different approaches to landmark localization on two datasets of varying difficulty, finding promising results for both paradigms. Our Quantile Binning method is generalisable to any continuous uncertainty measure, and the examples we investigate in this study can be applied as a postprocessing step to any heatmapbased landmark localization method. We aspire that this work can be used as a framework to build, evaluate and compare uncertainty metrics in landmark localization beyond those demonstrated in this paper.
2 Related Work
2.1 Landmark Localization
The recent advancement in machine learning has led to convolutional neural networks (CNNs) dominating the task of landmark localization. Encoderdecoder methods, originally proposed for the task of image segmentation
[21], have cemented themselves as one of the leading approaches for landmark localization in both the medical domain [19, 30, 25]and computer vision
[24, 27, 2]. The architecture of these methods allow the analysis of images at multiple resolutions, learning to predict a Gaussian heatmap centred around the predicted landmark location. The activation of each pixel in the heatmap can be interpreted as showing the pseudoprobability of that pixel being the target landmark. The network learns to generate a high response near the landmark, smoothly attenuating the responses in a small radius around it. Regressing heatmaps proves more effective than regressing coordinates
[29], as the heatmap image offers smoother supervision than direct coordinate values, and also models some uncertainty in the prediction.However, in medical imaging the number of available training samples is often small so the encoderdecoder network is forced to be shallow, compromising its performance [29]. One method to overcome this is via a patchbased approach; alleviating the problem by sampling many small “patches’ from an image, learning the relationship between each patch and the target landmark [5, 6]. This approach can generate orders of magnitude more training samples from a single image compared to the encoderdecoder style methods. Furthermore, patchbased models that use Fully Convolutional Networks (FCN) have fewer parameters than encoderdecoder architectures, decreasing computational requirements and training times [23].
Noothout et al. [18] implemented a patchbased network using an FCN to jointly perform classification and regression on each patch. The coarse binary classification task determines whether a patch contains the landmark, and the more precise regression task estimates the displacement from the patch to the landmark. This multitask, joint learning leads to a lightweight network and enhanced localization performance, with the two tasks sharing a feature representation that improves the performance of both [22]. However, the resulting network has a strong local focus and is also susceptible to failure if the predicted patch from the classification task is incorrect. In a followup work, Noothout et al. [17] extended their work [18] into a twostage method: they first train a CNN to provide global estimates for the landmarks, then employ specialised CNNs for each landmark for the final prediction. This method improves upon the first in terms of localization error, but has the drawback of requiring multiple training stages.
To mitigate the inherent local focus of the patchbased methods, we extended the patchbased network [18] by borrowing heatmap regression from the encoderdecoder networks; reforming the binary classification task as a Gaussian heatmap regression task [23]. Named PHDNet (Patch Heatmap & Displacement), this smoother supervision improved performance, reducing misidentifications compared to using the classification branch from the prior work [18]. Furthermore, we introduced the method Candidate Smoothing, combining the features from the two branches to output more accurate predictions along with an uncertainty measure.
2.2 Uncertainty Estimation
Estimating the uncertainty of machine learning predictions is a topic of growing interest, particularly relevant in the domain of medical imaging where there is often a human in the loop to manually correct flagged predictions. A concentrated effort in uncertainty estimation has been applied to image segmentation by the community, a task similar to landmark localization that instead aims to predict a mask for an entire structure rather than a single point. Some of the most successful approaches use Bayesian approximation methods like MonteCarlo dropout [16] or an ensemble of networks [14], using the variance of repeated predictions as an indicator for uncertainty. The gold standard approach across many studies is to use an ensemble of networks. This method affords better performance [14] and a more accurate mechanism for Bayesian marginalization [26] compared to a single model using MonteCarlo dropout. However, the obvious drawback of ensembles is the need to train multiple models.
In landmark localization we are predicting a single point rather than a mask, but similar uncertainty estimation approaches can be utilised. However, there are limited works exploring uncertainty in landmark localization. Payer et al. [20] directly modeled uncertainty during training by learning the Gaussian covariances of target heatmaps, and predicting the distribution of likely locations of the landmark at test time. Lee et al. [13] borrowed from image segmentation approaches by proposing a Bayesian CNN that utilised MonteCarlo dropout to predict the location and subjectlevel uncertainty of cephalometric landmarks. Another method to measure the subjectlevel confidence (the inverse of uncertainty) of a heatmapbased landmark prediction is to measure the maximum heatmap activation (MHA) of the predicted heatmap. Since the activation of a Gaussian heatmap at a particular pixel represents the pseudoprobability of the pixel being the landmark, we can use this pseudoprobability as a confidence measure: the higher the activation, the more certain the prediction. Drevicky et al. [4] compared MHA with ensemble and MonteCarlo dropout methods, finding MHA surprisingly effective given its simplicity. However, similarly to image segmentation, they found using an ensemble of models was best at predicting uncertainty. They calculated the coordinate prediction variance between an ensemble of models, and found this method performed best at estimating prediction uncertainty.
In our earlier work utilising the patchbased model PHDNet, MHA was also used as the uncertainty metric [23]. However, the heatmap analysed is distinctly different from the heatmaps predicted by encoderdecoder networks. Rather than explicitly learning a Gaussian function centred around the landmark, the approach combined patchwise heatmap and displacement predictions. We produced a new nonGaussian heatmap, where the activation of each pixel is defined by the number of patches that voted for it, regularised by the coarse global likelihood prediction. Therefore, the resulting heatmap represents patchwise ensemble votes rather than a Gaussian function, where the MHA is the pixel with the most “patch votes”.
To the best of our knowledge, no study has investigated how heatmapbased uncertainty estimation measures can be used to filter out poor predictions in landmark localization. Furthermore, no general framework has been proposed to compare how well uncertainty measures can predict localization error  an important practical application in clinical settings.
3 Contributions
In this paper, we propose a general framework to compare and evaluate uncertainty measures for landmark localization. This work extends the analysis of MHA in [23], with more in depth experiments and comparisons. Our contributions, depicted in Fig. 1, are threefold:
We demonstrate the impact of our contributions by using our proposed Quantile Binning to compare EMHA to two existing coordinate extraction and uncertainty estimation methods: a baseline of Single Maximum Heatmap Activation (SMHA), and a gold standard of Ensemble Coordinate Prediction Variance (ECPV). In Sec. 6.2, we compare the baseline coordinate extraction performance of the three approaches, followed by the uncertainty estimation performance in Sec. 6.3. We explore the reach of heatmapbased uncertainty measures by demonstrating they are applicable to both UNet regressed Gaussian heatmaps and patchbased voting heatmaps. We show each uncertainty measure can identify a subset of predictions with significantly lower mean error than the full set by filtering out predictions from high uncertainty bins (Fig. 1c). Finally, in Sec. 7.2 we make recommendations for which uncertainty measure to use, and how to use it.
We provide an open source implementation of this work along with the data to reproduce our results at
https://github.com/pykale/pykale/tree/main/examples/landmark_uncertainty.4 Methods
4.1 Landmark Localization Models
First, we briefly review the two models we use for landmark localization, allowing us to compare the generalisability of our uncertainty measures across different heatmap generation approaches. We implement a variation of the popular encoderdecoder networks that regresses Gaussian heatmaps, UNet [21]. We also implement a patchbased method, PHDNet [23], which produces a heatmap from patch votes.
4.1.1 EncoderDecoder Model (UNet)
The vast majority of stateoftheart landmark localization approaches are based on the foundation of a UNet style encoderdecoder architecture. The architecture of UNet follows a “U” shape, first extracting features at several downsampled resolutions, before rebuilding to the original dimensionality in a symmetrical upsampling path. Skip connections are employed between each level, preserving spatial information. The rationale behind the architecture design is to inject some inductive bias into the model architecture itself, helping it learn the local characteristics of each landmark, while preserving the global context.
Rather than regressing coordinates directly, the objective of the model is to learn a Gaussian heatmap image for each landmark, with the centre of the heatmap on the target landmark. For each landmark with 2D coordinate position , the 2D heatmap image is defined as the 2D Gaussian function:
(1) 
The network learns weights and biases to predict the heatmap . During inference, we can interpret the activation of each pixel in the predicted heatmap as the pseudoprobability of that pixel being the landmark. We will exploit this in our uncertainty estimation methods.
4.1.2 Patchbased Model (PHDNet)
Patchbased models use a Fully Convolutional Network (FCN), with the architecture resembling the first half of an encoderdecoder architecture. Therefore, they are more lightweight than encoderdecoder networks, with significantly less parameters leading to faster training.
In our earlier work, we proposed PHDNet: a multitask patchbased network [23], building on the work by Noothout et al. [18]. We incorporate a variant of the heatmap objective function from encoderdecoder networks into the objective function, predicting the 2D displacement from each patch to the landmark alongside the coarse Gaussian pseudoprobability of each patch.
PHDNet aggregates the patchwise predictions to obtain a heatmap by plotting candidate predictions from the displacement branch as small Gaussian blobs, then regularising the map by the upsampled Gaussian from the heatmap branch.
Again, we can consider the activation of each pixel in heatmap as an indicator for uncertainty, where instead of the psuedoprobability, the activation represents the amount of “patch votes”.
4.1.3 Ensemble Models
Using an ensemble of models is more robust than using a single model, as it reduces the effect of a single model becoming stuck in a local minima during training. Furthermore, we can use the variance in the predictions of each model to estimate the uncertainty of the prediction [4]. We use an ensemble of models where each model is identical, except the random initialisation of parameters.
4.2 Estimating Uncertainty and Coordinate Extraction
Although generated differently, we hypothesise both UNet and PHDNet produce heatmaps containing useful information to quantify a prediction’s uncertainty  but are they equally effective? To this end, we compare the performance of both models under three uncertainty estimation methods: a baseline approach, our proposed approach extending the baseline to an ensemble of networks, and the gold standard approach. Each method extracts coordinate values from the predicted heatmap, and estimates the prediction’s uncertainty.
4.2.1 Single Maximum Heatmap Activation (SMHA)
We introduce the baseline coordinate extraction and uncertainty measure. We use the standard method to obtain the predicted landmark’s coordinates from the predicted heatmap , by finding the pixel with the highest activation:
(2) 
We hypothesise that the pixel activation at the coordinates can describe the model’s uncertainty: the higher the activation, the lower the uncertainty, and the lower the prediction error. However, due to this inverse relationship, this measures “confidence”, not uncertainty.
We transform our confidence metric to an “uncertainty” metric , by applying the following transformation to the pixel activation at the predicted landmark location:
(3) 
where is a small constant scalar that prevents . Now, as the pixel activation at increases, decreases.
We call the transformed activation of this peak pixel Single Maximum Heatmap Activation (SMHA). This is a continuous value bounded between for UNet, and bounded between for PHDNet, where is the number of patches. The lower the SMHA, the lower the uncertainty.
4.2.2 Ensemble Maximum Heatmap Activation (EMHA)
In this work we extend the SMHA uncertainty measure to ensemble models. We hypothesise that EMHA should hold a stronger correlation with error than SMHA due to the additional robustness an ensemble of models affords. We generate the mean heatmap of the models in the ensemble, and obtain the predicted landmark coordinates as the pixel with the highest activation:
(4) 
Again, we hypothesise the activation of this pixel correlates with model confidence. Similar to SMHA, we inverse the pixel activation and add a small to the activation of to give us our uncertainty measure, :
(5) 
EMHA is a continuous value constrained to the same bounds as SMHA. This is a form of late feature fusion, combining features from all models before a decision is made.
4.2.3 Ensemble Coordinate Prediction Variance (ECPV)
We also implement the gold standard of uncertainty estimation: ensemble coordinate prediction variance [4]. The more the models disagree on where the landmark is, the higher the uncertainty.
To extract a landmark’s coordinates we first use the traditional SMHA coordinate extraction method on each of the models’ predicted heatmaps. Then, we use decisionlevel fusion to calculate the mean coordinate of the individual predictions to compute the final coordinate predictions :
(6) 
We generate the Ensemble Coordinate Prediction Variance (ECPV) by calculating the mean absolute difference between the model predictions and :
(7) 
This is a continuous value bounded between , where and are the height and width of the original image, respectively. The more the models disagree on the landmark location, the higher the coordinate prediction variance, and the higher the uncertainty.
Unlike SMHA and EMHA, this metric completely ignores the value of the heatmap activations, potentially losing some useful uncertainty information by opting for fusion at the decisionlevel.
4.3 Quantile Binning: Categorising Predictions by Uncertainty and Estimating Error Bounds
We leverage the described uncertainty measures to inform the subjectlevel uncertainty of any given prediction, i.e. is the model’s prediction likely to be accurate, or inaccurate based on this uncertainty value? We propose a datadriven approach, Quantile Binning, using a holdout validation set to establish thresholds delineating varying levels of uncertainty specific to each trained model. We use these learned thresholds to categorise our predictions into bins and estimate error bounds for each bin. We opt for a datadriven approach compared to a rulebased approach for two reasons: 1) When models are constrained to a limited dataset (in the order of hundreds of training samples), they can have difficulty converging well, particularly in the task of localizing a difficult landmark. 2) If the limited dataset contains high variance, different training sets can lead to confoundingly different learned weights under the same model architecture and training schedule. Therefore, establishing a set of thresholds for each model is more invariant to training differences compared to using the same thresholds for all models.
Quantile Binning is application agnostic; applicable to any data as long as it consists of continuous tuples of .
In this paper, we generate these pairings after the landmark localization model is trained. We use a holdout validation set and make coordinate predictions and uncertainty estimates using each of our three uncertainty measures described in Section 4.2. Since we have the ground truth annotations of the validation set we can produce continuous tuples for each uncertainty measure.
4.3.1 Establishing Quantile Thresholds
We aim to categorise predictions using our continuous uncertainty metrics into bins. We make the following assumption: The true function between a good uncertainty measure and localization error is monotonically increasing (i.e. the higher the uncertainty, the higher the error).
Quantile binning is a nonparametric method that fits well with these assumptions  a variant of histogram binning which is commonly used for calibration of predictive models [9, 15]
. By considering the data in quantiles rather than intervals, we can better capture a skewed distribution as the outliers in the tail of the distribution can be grouped into the same group. In other words,
quantiles divide the probability distribution into areas of approximately equal probability.
This property allows us to interrogate modelspecific (epistemic) uncertainties. Rather than compute uncertainty thresholds based on predefined error thresholds for each bin, we use Quantile Binning to create thresholds that group our samples in relative terms. This enables the user to flag the worst of predictions. We describe the steps below.
First, for any given uncertainty measure we sort our validation set tuples in ascending order of their uncertainty value and sequentially group them into equalsized bins . We assign each bin a pair of boundaries defined by the uncertainty values at the edges of the bin to create an interval: . To capture all predictions at the tail ends of the distribution, we set , and .
During inference, we use these boundaries to bin our predictions into bins (…), with uncertainty increasing with each bin. For each predicted landmark with uncertainty where , is binned into . The distribution of samples should be uniform across the bins, due to the quantile method we used to obtain thresholds.
The higher the value of , the more finegrained we can categorise our uncertainty estimates. However, as increases the method becomes more sensitive to any miscalibration of the uncertainty measure, leading to less accurate prediction binnings.
This method is agnostic of the scale used in the uncertainty measure, and therefore is applicable to all our defined uncertainty measures.
4.3.2 Estimating Error Bounds using Isotonic Regression
Establishing thresholds has allowed us to filter predictions by uncertainty in relative terms, but we lack a method to estimate absolute localization error for each bin. For example, for an easy landmark, the samples in may have a very low localization error in absolute terms, but for a more difficult landmark even the lowest relative uncertainty samples in may have a high error. Therefore, in order to offer users information about the expected error for each group, we present a datadriven approach to predict error bounds.
First, we estimate the true function between the uncertainty measure and localization error using our holdout validation set. However, since the validation tuples represent a small random sample of the true distribution, plotting this relationship may be noisy. Therefore, to make the best approximation of our assumed true function, we use isotonic regression to fit a monotonically increasing line between uncertainty and localization error, using our validation tuples. Isotonic regression is a method to fit a freeform, nondecreasing line to a set of observations, also commonly used for predictive model calibration [28, 9]. It is nonparametric, so can learn the true distribution if given enough i.i.d. data. The regression seeks a weighted least squares fit subject to the constraint that :
(8) 
where and is the number of pairs. In our case, the observations , are the tuples.
Next, we use our isotonically regressed line to estimate error bounds for each of our quantile bins. Since we are interested in error bounds that only increase with uncertainty, we discard the potentially noisy observed tuples from our validation set and instead predict error bounds from the uncertainty values using the fitted function. For each bin’s threshold interval [, ), we estimate the expected error interval: [. We use these values as the estimated lower and upper error bounds of the predictions in bin . Note, that for we only estimate an upper bound, and for we only estimate a lower bound.
In summary, we use a datadriven approach to learn thresholds to progressively filter predictions at inference into bins of increasing uncertainty, and assign each bin estimated error bounds.
4.4 Evaluation Metrics for Uncertainty Measures
In this subsection we construct methods to evaluate how well an uncertainty measure’s predicted bins represent the true error quantiles, and how accurate each bin’s estimated error bounds are.
4.4.1 Evaluating the Predicted Bins
A good uncertainty measure will have a strong correlation with localization error. Therefore, it should provide quantile thresholds that correspond to the true error quantiles. For example, since Bin contains the predictions with the uncertainties at the lowest quantile, the localization errors of the predictions in should be the lowest quantile of the test set. This can be generalised to each group, until , which should contain the errors in the quantile.
To evaluate this desired property, we propose to measure the similarity between each predicted bin and its respective theoretically perfect bin.
We create the ground truth (GT) bins by ordering the test set samples in ascending order of error. Then, we sequentially bin them into equally sized bins: .
For each predicted and GT bin pair &
, we calculate the Jaccard Index (JI) between them and report the mean measure of each bin across all folds:
(9) 
The higher the JI, the better the uncertainty measure has binned predictions by localization error. Therefore, it follows that the higher the JI, the better the uncertainty measure predicts localization error.
4.4.2 Accuracy of Estimated Error bounds
A good uncertainty measure will have a monotonically increasing relationship with localization error. Therefore, estimating the true function using isotonic regression should provide accurate error bound estimations.
To measure this, for each bin , we calculate the percentage of predictions whose error falls between the estimated error bound interval, . The higher the percentage, the higher the accuracy of our estimated upper error bounds.
5 Datasets
We perform our experiments using a dataset from the ASPIRE Registry [11], with Cardiac Magnetic Resonance Imaging (CMR) sequences containing a mix of subjects suffering from pulmonary arterial hypertension (PAH) and no pulmonary hypertension (PH). Each subject has a four chamber (4CH) view and/or a short axis view (SA). Each CMR sequence has a spatial resolution of pixels, where each pixel represents 0.9375mm of the organ, and 20 frames (we use only the first frame for landmark localization in this study). There are 303 SA images, each with three annotated landmarks: the inferior right ventricle insertion point (infSA), the superior right ventricle insertion point (supSA), and the inferior lateral reflection of the right ventricle free wall (RVSA). There are 422 4CH images, each with three annotated landmarks: the apex of the left ventricle at end diastole (LVDEV Apex), the mitral valve (mitral), and tricuspid valve (tricuspid). The 4CH dataset represents a more challenging landmark localization task as the images have much higher variability than the SA dataset. The landmarks were decided and manually labelled by a radiologist, as shown in Fig. 2. For this study, we consider the SA images the EASY dataset, and the 4CH images the HARD dataset.
6 Experiments and Results
First, in Sec. 6.2 we present the baseline landmark localization performance of PHDNet and UNet over both SA and 4CH datasets using the SMHA, ECPV, and EMHA methods to extract coordinates. This gives us a comparison of the coordinate extraction performance from each of our methods, and a baseline to measure the effectiveness of each method’s uncertainty estimation. Second, in Sec. 6.3 we interrogate how using Quantile Binning with our uncertainty measures delineates predictions in terms of their localization error, and compare the predicted bins to the ground truth error quantiles. We show a practical example of how filtering out highly uncertain predictions can dramatically increase the proportion of acceptable localization predictions. Finally, in Sec. 6.4 we assess how well the uncertainty measures can predict error bounds for each bin. When comparing between we use an unpaired test () to test for significance. When comparing uncertainty metrics among the same Bin category and model, we use a paired test () to test for significance.
6.1 Experimental Setup
4 Chamber Images  Short Axis Images  

Method  UNet  PHDNet  UNet  PHDNet 
SMHA All  10.00 18.99  11.07 21.33  5.86 14.19  3.58 3.52 
SMHA  6.79 6.09  5.80 9.03  3.62 2.45  2.78 1.99 
EMHA All  6.36 8.01  9.14 18.11  4.37 8.86  3.36 3.50 
EMHA  4.93 2.85  4.70 3.21  2.98 2.09  2.39 1.90 
ECPV All  8.13 10.16  9.42 13.07  4.97 7.51  3.22 2.93 
ECPV  5.34 3.00  5.10 6.76  3.75 2.13  2.47 2.08 
. Mean error and standard deviation are reported across all folds & all landmarks.
Bold indicates best results in row for the given dataset.We split both datasets into 8 folds, and perform 8fold cross validation for both UNet and PHDNet. For each iteration, we select one fold as our testing set, one our holdout validation set and the remaining 6 as our training set. We select for the ensemble methods, training 5 separate models at each fold. We chose to compromise with computational constraints, asserting that 5 models are representative to compare the uncertainty methods for our purposes. Each model is identical apart from randomly initialised weights. We randomly select a model for our SMHA uncertainty measure. For our Quantile Binning method, we select for 5 bins, balancing the constraints of our small datasets with a useful number of uncertainty categories.
We implement our UNet model [21] using the MONAI package [3]. We design the model with 5 encodingdecoding levels, creating 1.63M learnable parameters. We modify the objective function from image segmentation to simultaneous landmark localization, minimising the mean squared error between the target and predicted heatmaps. We use the full pixel image as input, and learn heatmaps of the same size. We generate target heatmaps using Eq. (1
) with a standard deviation of 8, and train for 1000 epochs with a batch size of 2, and a learning rate of 0.001 using the Adam Optimiser (settings from
[23]).We implement our PHDNet model following [23], creating a model with 0.06M learnable parameters. For all experiments we trained PHDNet for 1000 epochs using a batch size of 32 and a learning rate of 0.001, using the Adam Optimiser. We train one landmark at a time. Note, the only difference in setup from [23] in this work is different fold splits and training for an additional 500 epochs (same as UNet) with no early stopping, since we now use our validation set for Quantile Binning.
6.2 Baseline Landmark Localization Performance
Figure 3 and the All rows in Table 1 show the baseline performance for UNet and PHDNet at localizing landmarks in our 4CH and SA datasets. We make the following observations:

When considering the entire set of landmarks (All), performance is better on the SA dataset for both models, with PHDNet outperforming UNet. On the 4CH dataset, UNet outperforms PHDNet, suggesting the higher capacity model of UNet is more robust to datasets with large variations.

Simply using a single model with our SMHA strategy is predictably less robust than ensemble approaches.

EMHA outperforms the previous gold standard of ECPV for coordinate extraction. However, does it outperform ECPV in terms of uncertainty estimation? We explore this in Section 6.3.
6.3 Analysis of the Predicted Quantile Bins
We apply quantile binning to each uncertainty measure: SMHA, EMHA and ECPV. We compare results over UNet and PHDNet for both the SA and 4CH datasets.
We found the most useful information is at the tail ends of the uncertainty distributions. Figs. 3(c) & 3(d) plot the Jaccard Index between ground truth error quantiles and predicted error quantiles. The predictions in the highest uncertainty quantile () is significantly better at capturing the correct subset of predictions than the intermediate bins (). Similarly, in most cases the bin representing the lowest uncertainties () had a significantly higher Jaccard Index than the intermediate bins, but still lower than . Figs. 3(a) & 3(b) show the mean error of the samples of each quantile bin over both datasets. The most significant reduction in localization error is from to for all uncertainty measures, further indicating our uncertainty measures are well calibrated to filter out gross mispredictions. These findings suggest that most of the utility in the uncertainty measures investigated can be found at the tail ends of the scale. This is an intuitive finding, as the predictions in are certainly uncertain, and the predictions in are certainly certain.
The worse trained the landmark localization model, the more useful the uncertainty measure. Table 1 shows the localization error of all methods, models and datasets for the entire set (All) and lowest uncertainty subset () of predictions. PHDNet’s baseline localization performance on the 4CH dataset was worse than UNet. However, when we consider the lowest uncertainty subset of predictions (), PHDNet sees a 47% average reduction in error from all predictions (All), compared to UNet’s average reduction of 30%. Similarly, UNet performed worse than PHDNet for the SA dataset, but saw an average error reduction of 31% compared to PHDNet’s 25%. This suggests that all investigated uncertainty measures are more effective at identifying gross mispredictions when models are poorly trained.
Using heatmapbased uncertainty measures is generalisable across heatmap generation approaches. The bin similarities in Figs. 3(d) & 3(c) show that using SMHA and EMHA yields similar performance with PHDNet and UNet, despite their different heatmap derivations. Surprisingly using EMHA does not give a significant increase in bin similarity compared to SMHA, suggesting the thresholds remain relatively stable across models.
No investigated method is conclusively best for estimating uncertainty in all scenarios. For the more challenging 4CH data, Fig. 3(c) shows ECPV is significantly better than SMHA and EMHA for both models at capturing the true error quantiles, corroborating the findings of [4]. ECPV is particularly good at identifying the worst predictions (). For the easier SA data, no method has a significantly higher Jaccard Index. Therefore, when we generalise across both models and datasets, all uncertainty measures fared broadly similar on average in terms of error reduction between the entire set and the subset of predictions. SMHA had an average error reduction of 35.07%, EMHA 32.94% and ECPV 32%.
Despite similar performances in uncertainty estimation, we found EMHA yields the greatest localization performance overall. Table 1 shows EMHA offers the best localization performance for across both datasets and models. This is due to the combination of offering the most robust coordinate extraction on average (Fig. 3), and similar uncertainty estimation performance (Fig. 3(c), Fig. 3(d)). We more concretely demonstrate Quantile Binning’s ability to identify low uncertainty predictions in Fig. 5. We clearly observe a significant increase in the percentage of images below the acceptable error threshold of 5mm when considering only predictions in  with EMHA giving the greatest proportion of acceptable predictions. For both datasets we observe that PHDNet using EMHA has a higher proportion of acceptable predictions compared to UNet. If we consult Table 1, we also observe that PHDNet’s predictions in bin indeed have lower mean localization error than UNet’s corresponding bin for both datasets, with EMHA offering the lowest average localization error.
6.4 Analysis of Error Bound Estimation
We analyse how accurate the isotonically regressed estimated error bounds are for our quantile bins. Figs. 3(e) & 3(f) show the percentage of samples in each bin that fall between the estimated error bounds.
We found we can predict the error bounds for the two extreme bins better than the intermediate bins. Figs. 3(e) & 3(f) show a similar pattern to the Jaccard Index Figs. 3(c) & 3(d), with the two extreme bins and predicting error bounds significantly more accurately than the inner bins. Again, this indicates the most useful uncertainty information is present at the extremes of the uncertainty distribution, with the predicted uncertaintyerror function unable to capture a consistent relationship for the inner quantiles. This finding is intuitive, as it is easier to put a lower error bound on the most uncertain quantile of predictions or upper bound on the most certain predictions, than it is to assign tighter error bounds on middling uncertainty values.
We also found that a well defined upper bound for heatmap activations is important for error bound estimates. For both the 4CH and SA datasets, SMHA for PHDNet is significantly more accurate at predicting error bounds for the highest uncertainty quantile compared to the lowest uncertainty quantile (56% & 72% compared to 30% & 27% for 4CH & SA, respectively), correlating with SMHA capturing a greater proportion of those bins (Jaccard Indexes of 32% & 24% compared to 16% & 15%). On the other hand, UNet using SMHA predicts error bounds for low uncertainty bins better than high uncertainty bins. This suggests that although PHDNet’s heatmap activation is a robust predictor of gross mispredictions, the less tight upper bound of its heatmap activations make it hard to make an accurate prediction for the lowest uncertainty quantile (). This is alleviated by using an ensemble of networks in EMHA, where the bound accuracy is improved to 62%.
EMHA and ECPV are more consistent than SMHA. Overall, there is no significant difference between the error bound estimation accuracy of EMHA and SMHA, but Figs. 3(e) & 3(f) show EMHA has less variation in performance between UNet and PHDNet compared to SMHA, suggesting an ensemble of models is more robust. For the 4CH dataset, PHDNet using ECPV is on average significantly more accurate at predicting error bounds than SMHA and EMHA. However, there are no significant differences for PHDNet on the easier SA dataset, nor UNet on either dataset. There are also no significant differences between UNet and PHDNet in error bound estimation accuracy, with each method broadly equally effective for both models.
7 Discussion and Conclusion
7.1 Summary of Findings
This paper presented a general framework to assess any continuous uncertainty measure in landmark localization, demonstrating its use on three uncertainty metrics and two paradigms of landmark localization model. We introduced a new coordinate extraction and uncertainty estimation method, EMHA, offering the best baseline localization performance and competitive uncertainty estimation.
Our experiments indicate that both heatmapbased uncertainty metrics (SMHA, EMHA), as well as the gold standard coordinate variance uncertainty metric (ECPV) are applicable to both UNet and PHDNet. Despite the two models’ distinctly different approaches to generating heatmaps, using the maximum heatmap activation as an indicator for uncertainty is effective for both models. We showed that all investigated uncertainty metrics were effective at filtering out the gross mispredictions () and identifying the 20% most certain predictions (), but could not capture useful information for the intermediate uncertainty bins ().
Our experiments also showed that EMHA and SMHA had a surprisingly similar ability to capture the true error quantiles of the best and worst 20% of predictions (Figs. 3(c) & 3(d)), but EMHA was more consistent with its performance predicting the error bounds of those bins across models (Figs. 3(e) & 3(f)). This suggests that the calibration of the head and tail ends of the heatmap distributions is stable across our ensemble of models, but susceptible to variance when fitting our isotonically regressed line to predict error bounds. On the more challenging 4CH dataset, ECPV broadly remained the gold standard for filtering out the worst predictions, but this trend did not continue in the easier SA dataset (Fig 5).
In terms of error bound estimation, we found a strong correlation with the bin’s Jaccard Index to the true quantiles. Bins and could offer good error bound estimates, but the intermediate bins could not (Figs. 3(e) & 3(f)). We found all uncertainty methods performed broadly the same: effective at predicting error bounds for and , but poor at predicting error bounds for . The one exception was PHDNet using SMHA, which could not accurately predict error bounds for due to the high variance in pixel activations of highly certain predictions.
Overall, we found that using EMHA provided the most robust coordinate extraction method of all methods, showing the best baseline localization error. When considering only the predictions with the lowest uncertainty (), using EMHA achieves the lowest mean localization error accross all models and datasets.
7.2 Recommendations
In terms of utility, when resources are available to train an ensemble of models, we recommend to use EMHA as the coordinate extraction and uncertainty estimation method. EMHA offers the best baseline localization performance with a sufficient ability to filter out the gross mispredictions () and identify the most certain predictions (). It can also sufficiently estimate error bounds for these two bins. However, between these thresholds the uncertainty metric is poorly calibrated to bin predictions in a robust way, or accurately predict error bounds. If resources are constrained, SMHA is surprisingly effective at capturing the true error quantiles for bins and , but note that when using a patchbased voting heatmap that is not strictly bounded, the error bound estimation for is not robust.
7.3 Conclusion
Beyond the above recommendations, we hope the framework described in this paper can be used to assess refined or novel uncertainty metrics for landmark localization, and act as a baseline for future work. Furthermore, we have shown that both the voting derived heatmap of PHDNet, and the regressed Gaussian heatmap of UNet can be exploited for uncertainty estimation. In this paper, we only explored the activation of the peak pixel, but it is likely that more informative measures can be extracted from the broader structure of the heatmap, promising greater potential for uncertainty estimation in landmark localization waiting to be uncovered.
References
 [1] (2005) Robust active appearance models and their application to IEEE trans. med. imag.. IEEE Trans. Med. Imag. 24 (9), pp. 1151–1169. Cited by: §1.

[2]
(2018)
SuperFAN: integrated facial landmark localization and superresolution of realworld low resolution faces in arbitrary poses with GANS
. Inn Proc. IEEE Conf. Comput. Vis. Pattern Recognit
, pp. 109–117. Cited by: §2.1.  [3] Project MONAI Note: Available at https://doi.org/10.5281/zenodo.4323059 External Links: Document, Link Cited by: §6.1.
 [4] (2020) Evaluating deep learning uncertainty measures in cephalometric landmark localization.. In BIOIMAGING, pp. 213–220. Cited by: §1, §2.2, §4.1.3, §4.2.3, §6.3.
 [5] (2015) Automatic localization of the left ventricle in cardiac MRI images using deep learning. In Proc. EMBC, pp. 683–686. Cited by: §2.1.
 [6] (2018) Fast multiple landmark localisation using a patchbased iterative network. In Proc. MICCAI, pp. 563–571. Cited by: §2.1.
 [7] (2011) Semiautomatic construction of reference standards for evaluation of image registration. Medical Image Analysis 15 (1), pp. 71–84. Cited by: §1.

[8]
(2019)
XAIExplainable artificial intelligence
. Science robotics 4 (37) (eng). External Links: ISSN 24709476 Cited by: §1. 
[9]
(2017)
On calibration of modern neural networks
. In International Conference on Machine Learning, pp. 1321–1330. Cited by: §4.3.1, §4.3.2.  [10] (2016) Interactive machine learning for health informatics: When do we need the humanintheloop?. Brain Informatics 3 (2), pp. 119–131. Cited by: §1.
 [11] (2012) ASPIRE registry: assessing the spectrum of pulmonary hypertension identified at a REferral centre. European Respiratory Journal 39 (4), pp. 945–955. Cited by: §5.
 [12] (2002) Consistent landmark and intensitybased image registration. IEEE Trans. Med. Imag. 21 (5), pp. 450–461. Cited by: §1.
 [13] (2020) Automated cephalometric landmark detection with confidence regions using bayesian convolutional neural networks. BMC Oral Health 20 (1), pp. 1–10. Cited by: §2.2.
 [14] (2020) Confidence calibration and predictive uncertainty estimation for deep medical image segmentation. IEEE Trans. Med. Imag. 39 (12), pp. 3868–3878. Cited by: §2.2.
 [15] (2015) Obtaining well calibrated probabilities using bayesian binning. TwentyNinth AAAI Conference on Artificial Intelligence 2015, pp. 2901–2907. Cited by: §4.3.1.
 [16] (2020) Exploring uncertainty measures in deep networks for multiple sclerosis lesion detection and segmentation. IEEE Trans. Med. Imag. 59, pp. 101557. Cited by: §2.2.
 [17] (2020) Deep learningbased regression and classification for automatic landmark localization in medical images. IEEE Trans. Med. Imag. 39 (12), pp. 4011–4022. External Links: Document Cited by: §2.1.
 [18] (2018) CNNbased landmark detection in cardiac CTA scans. In MIDL Amsterdam, pp. 1–11. Cited by: §2.1, §2.1, §4.1.2.
 [19] (2019) Integrating spatial configuration into heatmap regression based CNNs for landmark localization. IEEE Trans. Med. Imag. 54, pp. 207–219. Cited by: §2.1.
 [20] (2020) Uncertainty estimation in landmark localization based on gaussian heatmaps. In Uncertainty for Safe Utilization of Machine Learning in Medical Imaging, and Graphs in Biomedical Image Analysis, pp. 42–51. Cited by: §2.2.
 [21] (2015) UNet: convolutional networks for biomedical image segmentation. In Proc. MICCAI, pp. 234–241. Cited by: §1, §2.1, §4.1, §6.1.
 [22] (2017) An overview of multitask learning in deep neural networks. arXiv preprint arXiv:1706.05098. Cited by: §2.1.
 [23] (2021) Confidencequantifying landmark localisation for cardiac MRI. In IEEE Int. Symp. Biomed. Imag., pp. 985–988. Cited by: §1, §2.1, §2.1, §2.2, §3, §4.1.2, §4.1, §6.1, §6.1.
 [24] (2018) Quantized densely connected UNets for efficient landmark localization. In Proc. ECCV, pp. 339–354. Cited by: §2.1.
 [25] (2018) Deep geodesic learning for segmentation and anatomical landmarking. IEEE Trans. Med. Imag. 38 (4), pp. 919–931. Cited by: §2.1.
 [26] (2020) Bayesian deep learning and a probabilistic perspective of generalization. arXiv preprint arXiv:2002.08791. Cited by: §2.2.
 [27] (2017) Stacked hourglass network for robust facial landmark localisation. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit, pp. 79–87. Cited by: §2.1.

[28]
(2002)
Transforming classifier scores into accurate multiclass probability estimates
. In Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 694–699. Cited by: §4.3.2.  [29] (2017) Detecting anatomical landmarks from limited medical imaging data using twostage taskoriented deep neural networks. IEEE Trans. on Imag. Processing 26 (10), pp. 4753–4764. Cited by: §2.1, §2.1.
 [30] (2019) An attentionguided deep regression model for landmark detection in cephalograms. In Proc. MICCAI, pp. 540–548. Cited by: §2.1.