Log In Sign Up

Common Limitations of Image Processing Metrics: A Picture Story

by   Annika Reinke, et al.

While the importance of automatic image analysis is increasing at an enormous pace, recent meta-research revealed major flaws with respect to algorithm validation. Specifically, performance metrics are key for objective, transparent and comparative performance assessment, but relatively little attention has been given to the practical pitfalls when using specific metrics for a given image analysis task. A common mission of several international initiatives is therefore to provide researchers with guidelines and tools to choose the performance metrics in a problem-aware manner. This dynamically updated document has the purpose to illustrate important limitations of performance metrics commonly applied in the field of image analysis. The current version is based on a Delphi process on metrics conducted by an international consortium of image analysis experts.


page 8

page 10


BIAS: Transparent reporting of biomedical image analysis challenges

The number of biomedical image analysis challenges organized per year is...

Methods and open-source toolkit for analyzing and visualizing challenge results

Biomedical challenges have become the de facto standard for benchmarking...

How can we learn (more) from challenges? A statistical approach to driving future algorithm development.

Challenges have become the state-of-the-art approach to benchmark image...

PulseSatellite: A tool using human-AI feedback loops for satellite image analysis in humanitarian contexts

Humanitarian response to natural disasters and conflicts can be assisted...

A Review on Deep-Learning Algorithms for Fetal Ultrasound-Image Analysis

Deep-learning (DL) algorithms are becoming the standard for processing u...

Sources of performance variability in deep learning-based polyp detection

Validation metrics are a key prerequisite for the reliable tracking of s...

1 Purpose

Metrics are key to assessing the performance of image analysis algorithms in an objective and meaningful manner. So far, however, relatively little attention has been given to the practical pitfalls when using specific metrics for a given image analysis task. An international survey (maier2018rankings), for example, revealed the choice of inappropriate metrics as one of the core problems related to performance assessment in medical image analysis. Similar problems are present in other fields of imaging research (honauer2015hci; correia2006video).

Under the umbrella of the Helmholtz Imaging Platform (HIP)111, three international initiatives have now joined forces to address these issues: the Biomedical Image Analysis Challenges (BIAS) initiative222, the Medical Image Computing and Computer Assisted Interventions (MICCAI) Society’s challenge working group, as well as the benchmarking working group of the MONAI framework333 A core mission is to provide researchers with guidelines and tools to choose the performance metrics in a problem-aware manner. This dynamically updated document aims to illustrate important pitfalls and drawbacks of metrics commonly applied in the field of image analysis. The current version is based on a Delphi process (brown1968delphi) on metrics conducted with an international consortium of medical image analysis experts.

2 Segmentation metrics

Image segmentation is one of the most popular image processing tasks. In fact, an international meta-analysis revealed segmentation as the most frequent medical image processing task in international competitions (challenges(maier2018rankings). The chosen metrics in segmentation challenges radically influence the resulting rankings (maier2018rankings; reinke2018exploit), and although several papers highlight specific strengths and weaknesses of common metrics (kofler2021DICE; gooding2018comparative; vaassen2020evaluation; konukoglu2012discriminative; margolin2014evaluate), researchers are missing guidelines for choosing the right metric for a given problem (maier2018rankings). To address this community request, this document summarizes common pitfalls related to the most frequently used metrics in medical image segmentation, namely the Dice Similarity Coefficient (DSC) (dice1945measures), the Hausdorff Distance (HD(huttenlocher1993comparing), and the Intersection over Union (IoU) (jaccard1912distribution) (see Figure 1). To this end, the problems related to segmentation metrics are assigned to four categories, namely (1) awareness of fundamental mathematical properties of metrics, necessary to determine the applicability of a metric, (2) suitability for the underlying image processing task, (3) metric aggregation to combine metric values of single images into one accumulated score and (4) metric combination to reflect different aspects in algorithm validation.

[title= Most common segmentation metrics, colback=white]

Figure 1: Most commonly used overlap-based (a/b) and contour-based (c) segmentation metrics: (a) the Dice Similarity Coefficient (DSC), (b) the Intersection over Union (IoU) and (c) the Hausdorff Distance (HD), with denoting the cardinality of set , the intersection between sets and , the union between sets and and the distance between points and .

2.1 Fundamental mathematical properties

Awareness of the mathematical properties of a metric is crucial when determining its suitability for a given application. In this section, we focus on the DSC and the HD, but the properties also apply to other related metrics, such as the IoU (also called Jaccard Index (jaccard1912distribution)).

As illustrated in Figure 1a, the DSC was designed to measure the overlap between two given objects and yields a value between 0 (no overlap) and 1 (full overlap). The metric is straightforward to compute and interpret, but comes with several pitfalls highlighted in the following paragraphs:

Small structures

Segmentation of small structures, such as brain lesions, cells imaged at low magnification or distant cars, is essential for many image processing applications. In these cases, the DSC may not be an appropriate metric, as illustrated in Figure 2. In fact, a single-pixel difference between two predictions can have a large impact on the metric difference. Given that the correct outlines (e.g. of pathologies) are often unknown and taking into account the potentially high inter-observer variability related to generating reference annotations (joskowicz2019inter), it is typically not desirable for few pixels to influence the metrics as much.

[title= Problem: Small structures, colback=white]

Figure 2: Effect of structure size on the DSC. The predictions of two algorithms (Prediction 1/2) differ in only a single pixel. In case of the small structure (bottom row), this has a substantial effect on the corresponding metric value.

Noise/errors in the reference annotations

Similar problems may arise in the presence of annotation artifacts. Figure 3 demonstrates that a single erroneous pixel in the reference annotation may lead to a substantial decrease in the measured performance, especially in the case of the HD.

[title= Problem: Noise and artifacts, colback=white]

Figure 3: Effect of annotation errors/noise. A single erroneously annotated pixel may lead to a large decrease in performance, especially in the case of the HD or in the case of the DSC when applied to small structures.

Shape unawareness

Metrics measuring the overlap between objects are not designed to uncover differences in shapes. This is an important problem for many applications, such as radiotherapy. Figure 4 illustrates that completely different object shapes may lead to the exact same DSC value.

[title= Problem: Shape unawareness, colback=white]

Figure 4: Effect of different shapes. The shapes of the predictions of five algorithms (Prediction 1-5) differ substantially, but lead to the exact same DSC.

Oversegmentation vs. undersegmentation

In some applications such as autonomous driving or radiotherapy, it may be highly relevant whether an algorithm tends to over- or under-segment the target structure. The DSC metric, however, does not represent over- and under-segmentation equally (yeghiazaryan2018family). As depicted in Figure 5, a difference of a single pixel in the outline yields different DSC scores (oversegmentation preferred). Other distance-based performance values such as the HD are invariant to these properties.

[title= Problem: Oversegmentation vs. undersegmentation, colback=white]

Figure 5: Effect of undersegmentation vs. oversegmentation. The outlines of the predictions of two algorithms (Prediction 1/2) differ in only a single pixel (Prediction 1: undersegmentation, Prediction 2: oversegmentation). This has no effect on the HD, but yields a substantially different DSC score.

2.2 Suitability for underlying image processing task

Performance metrics are typically expected to reflect a domain-specific validation goal (e.g. clinical goal). Previous research, however, suggests, that this is often not the case. A common problem is that segmentation metrics, such as the DSC, are applied to detection and localization tasks (jager2020challenges), as illustrated in Figure 6. From a clinical perspective, for example, the algorithm producing Prediction 2 and covering all three structures of interest (e.g. tumors) would be clinically much more valuable compared to the one producing a highly accurate segmentation for one structure but missing the other two in Prediction 1. This is not reflected in the metric values, which are substantially higher for Prediction 1. In general, the DSC is strongly biased against single objects, therefore not appropriate for a detection task of multiple structures (yeghiazaryan2018family).

[title= Problem: Task/metric mismatch, colback=white]

Figure 6: Effect of using a segmentation metric for object detection. In this example, the prediction of one algorithm only detecting one of three structures (Prediction 1) leads to a higher DSC compared to that of a second algorithm (Prediction 2) detecting all structures.

2.3 Metric aggregation

In international competitions (challenges), metric values are often aggregated over all test cases to produce a challenge ranking (maier2018rankings). Figures 7 and 8 illustrate why this may be problematic in the presence of missing values.

[title= Problem: Missing values for metrics with fixed upper/lower bounds, colback=white]

Figure 7: Effect of missing values when aggregating metric values. In this example, ignoring missing values leads to a substantially higher DSC compared to setting missing values to the worst possible value (here: 0).

In the case of metrics with fixed boundaries, like the DSC or the IoU, missing values can easily be set to the worst possible value (here: 0). For distance-based measures without lower/upper bounds, the strategy of how to deal with missing values is not trivial. In the case of the HD, one may choose the maximum distance of the image and add 1 or normalize the metric values to and use the worst possible value (here: 1). Crucially, however, every choice will produce a different aggregated value (Figure 8), thus potentially affecting the ranking.

[title= Problem: Missing values for metrics without fixed upper/lower bounds, colback=white]

Figure 8: Effect of missing values when aggregating metric values for metrics without fixed boundaries (here: HD). In this example, ignoring or treating missing values in different ways leads to substantially different HD values.

2.4 Metric combination

A single metric typically does not reflect all important aspects that are essential for algorithm validation. Hence, multiple metrics with different properties are often combined. However, the selection of metrics should be well considered as some metrics are mathematically related to each other (taha2014formal; taha2015metrics). A prominent example is the IoU

– the most popular segmentation metric in computer vision – which highly correlates with the

DSC – the most popular segmentation metric in medical image analysis. In fact, the IoU and the DSC are mathematically related (taha2015metrics):


Combining metrics that are related will not provide additional information for a ranking. Figure 9 illustrates how the ranking can change when adding a metric that measures different properties.

[title= Problem: Related metrics, colback=white]

Figure 9: Effect of combining different metrics for a ranking. Mutually dependent metrics (DSC and IoU) will lead to the same ranking, whereas metrics measuring different properties (HD) will lead to a different ranking.

3 Conclusion

Choosing the right metric for a specific image processing task is a non-trivial task. With this (dynamic) paper, we wish to raise awareness about some of the common flaws of the most frequently used metrics in the field of image processing, encouraging researchers to reconsider common workflows.


This work was initiated by the Helmholtz Association of German Research Centers in the scope of the Helmholtz Imaging Platform (HIP). It was further supported in part by the Intramural Research Program of the National Institutes of Health Clinical Center as well as by the National Cancer Institute (NCI) and the National Institute of Neurological Disorders and Stroke (NINDS) of the National Institutes of Health (NIH), under award numbers NCI:U01CA242871 and NINDS:R01NS042645. The content of this publication is solely the responsibility of the authors and does not represent the official views of the NIH.