Metrics are key to assessing the performance of image analysis algorithms in an objective and meaningful manner. So far, however, relatively little attention has been given to the practical pitfalls when using specific metrics for a given image analysis task. An international survey (maier2018rankings), for example, revealed the choice of inappropriate metrics as one of the core problems related to performance assessment in medical image analysis. Similar problems are present in other fields of imaging research (honauer2015hci; correia2006video).
Under the umbrella of the Helmholtz Imaging Platform (HIP)111https://www.helmholtz-imaging.de/, three international initiatives have now joined forces to address these issues: the Biomedical Image Analysis Challenges (BIAS) initiative222https://www.dkfz.de/en/cami/research/topics/biasInitiative.html?m=1611915160&, the Medical Image Computing and Computer Assisted Interventions (MICCAI) Society’s challenge working group, as well as the benchmarking working group of the MONAI framework333https://monai.io/. A core mission is to provide researchers with guidelines and tools to choose the performance metrics in a problem-aware manner. This dynamically updated document aims to illustrate important pitfalls and drawbacks of metrics commonly applied in the field of image analysis. The current version is based on a Delphi process (brown1968delphi) on metrics conducted with an international consortium of medical image analysis experts.
2 Segmentation metrics
Image segmentation is one of the most popular image processing tasks. In fact, an international meta-analysis revealed segmentation as the most frequent medical image processing task in international competitions (challenges) (maier2018rankings). The chosen metrics in segmentation challenges radically influence the resulting rankings (maier2018rankings; reinke2018exploit), and although several papers highlight specific strengths and weaknesses of common metrics (kofler2021DICE; gooding2018comparative; vaassen2020evaluation; konukoglu2012discriminative; margolin2014evaluate), researchers are missing guidelines for choosing the right metric for a given problem (maier2018rankings). To address this community request, this document summarizes common pitfalls related to the most frequently used metrics in medical image segmentation, namely the Dice Similarity Coefficient (DSC) (dice1945measures), the Hausdorff Distance (HD) (huttenlocher1993comparing), and the Intersection over Union (IoU) (jaccard1912distribution) (see Figure 1). To this end, the problems related to segmentation metrics are assigned to four categories, namely (1) awareness of fundamental mathematical properties of metrics, necessary to determine the applicability of a metric, (2) suitability for the underlying image processing task, (3) metric aggregation to combine metric values of single images into one accumulated score and (4) metric combination to reflect different aspects in algorithm validation.
2.1 Fundamental mathematical properties
Awareness of the mathematical properties of a metric is crucial when determining its suitability for a given application. In this section, we focus on the DSC and the HD, but the properties also apply to other related metrics, such as the IoU (also called Jaccard Index (jaccard1912distribution)).
As illustrated in Figure 1a, the DSC was designed to measure the overlap between two given objects and yields a value between 0 (no overlap) and 1 (full overlap). The metric is straightforward to compute and interpret, but comes with several pitfalls highlighted in the following paragraphs:
Segmentation of small structures, such as brain lesions, cells imaged at low magnification or distant cars, is essential for many image processing applications. In these cases, the DSC may not be an appropriate metric, as illustrated in Figure 2. In fact, a single-pixel difference between two predictions can have a large impact on the metric difference. Given that the correct outlines (e.g. of pathologies) are often unknown and taking into account the potentially high inter-observer variability related to generating reference annotations (joskowicz2019inter), it is typically not desirable for few pixels to influence the metrics as much.
Noise/errors in the reference annotations
Similar problems may arise in the presence of annotation artifacts. Figure 3 demonstrates that a single erroneous pixel in the reference annotation may lead to a substantial decrease in the measured performance, especially in the case of the HD.
Metrics measuring the overlap between objects are not designed to uncover differences in shapes. This is an important problem for many applications, such as radiotherapy. Figure 4 illustrates that completely different object shapes may lead to the exact same DSC value.
Oversegmentation vs. undersegmentation
In some applications such as autonomous driving or radiotherapy, it may be highly relevant whether an algorithm tends to over- or under-segment the target structure. The DSC metric, however, does not represent over- and under-segmentation equally (yeghiazaryan2018family). As depicted in Figure 5, a difference of a single pixel in the outline yields different DSC scores (oversegmentation preferred). Other distance-based performance values such as the HD are invariant to these properties.
2.2 Suitability for underlying image processing task
Performance metrics are typically expected to reflect a domain-specific validation goal (e.g. clinical goal). Previous research, however, suggests, that this is often not the case. A common problem is that segmentation metrics, such as the DSC, are applied to detection and localization tasks (jager2020challenges), as illustrated in Figure 6. From a clinical perspective, for example, the algorithm producing Prediction 2 and covering all three structures of interest (e.g. tumors) would be clinically much more valuable compared to the one producing a highly accurate segmentation for one structure but missing the other two in Prediction 1. This is not reflected in the metric values, which are substantially higher for Prediction 1. In general, the DSC is strongly biased against single objects, therefore not appropriate for a detection task of multiple structures (yeghiazaryan2018family).
2.3 Metric aggregation
In international competitions (challenges), metric values are often aggregated over all test cases to produce a challenge ranking (maier2018rankings). Figures 7 and 8 illustrate why this may be problematic in the presence of missing values.
In the case of metrics with fixed boundaries, like the DSC or the IoU, missing values can easily be set to the worst possible value (here: 0). For distance-based measures without lower/upper bounds, the strategy of how to deal with missing values is not trivial. In the case of the HD, one may choose the maximum distance of the image and add 1 or normalize the metric values to and use the worst possible value (here: 1). Crucially, however, every choice will produce a different aggregated value (Figure 8), thus potentially affecting the ranking.
2.4 Metric combination
A single metric typically does not reflect all important aspects that are essential for algorithm validation. Hence, multiple metrics with different properties are often combined. However, the selection of metrics should be well considered as some metrics are mathematically related to each other (taha2014formal; taha2015metrics). A prominent example is the IoU
– the most popular segmentation metric in computer vision – which highly correlates with theDSC – the most popular segmentation metric in medical image analysis. In fact, the IoU and the DSC are mathematically related (taha2015metrics):
Combining metrics that are related will not provide additional information for a ranking. Figure 9 illustrates how the ranking can change when adding a metric that measures different properties.
Choosing the right metric for a specific image processing task is a non-trivial task. With this (dynamic) paper, we wish to raise awareness about some of the common flaws of the most frequently used metrics in the field of image processing, encouraging researchers to reconsider common workflows.
This work was initiated by the Helmholtz Association of German Research Centers in the scope of the Helmholtz Imaging Platform (HIP). It was further supported in part by the Intramural Research Program of the National Institutes of Health Clinical Center as well as by the National Cancer Institute (NCI) and the National Institute of Neurological Disorders and Stroke (NINDS) of the National Institutes of Health (NIH), under award numbers NCI:U01CA242871 and NINDS:R01NS042645. The content of this publication is solely the responsibility of the authors and does not represent the official views of the NIH.