Most practical systems must be evaluated on all sorts of benchmarks, and a multitude of different metrics must be computed in order to make the decision upon acceptance of the system as functional or faulty. The same applies to deep learning models, competing against each other on carefully chosen benchmarks. Nevertheless, none of those common benchmarks for deep learning, in general, and semantic segmentation, in particular, consider to go deeper into the numbers and look at the given problem from a different angle. For example, none of them will provide you with an understanding whether your car detector completely fails at recognising all cars, or only small cars, or only cars located near buses. Given suitable data, it is possible to answer such questions, and here we show the value of doing that by highlighting the failure modes of the chosen networks. Failure modes are an essential part of our (human) learning process, and thus it motivates us to analyse failure modes of modern semantic segmentation models in a more detailed way.
We have chosen semantic segmentation as it is a critical component of scene understanding, and already finds its niche in many sorts of applications, ranging from driverless cars  to medical imaging . Besides that, there already exists several excellent diagnostics works in other domains [12, 10, 19], which motivates us to extend them for semantic segmentation. Our aim here is to encourage researchers and practitioners to look at semantic segmentation performance from all sorts of different angles, ranging from the connections between object size and per-pixel accuracy to different notions of uncertainty and error taxonomy.
In particular, we consider two state-of-the-art models on two standard datasets for semantic segmentation, namely, PASCAL VOC , and CityScapes , where performance levels on generic benchmarks, such as intersection-over-union and per-pixel accuracy, have already reached a very high bar. Besides that, both of them provide per-instance annotations, which gives us an opportunity to reason about model performance in terms of object characteristics. Each model on each dataset we describe with regard to its sensitivity to object size and aspect ratio, error taxonomy, uncertainty levels and their correlation with performance. Finally, for each case discussed, we showcase simple extensions and provide our recommendations about possible future research directions.
Our methodology is general and flexible, making it straightforward to retrieve the same characteristics for all sorts of models and datasets.
2 Related Work
2.1 Semantic Segmentation
Semantic segmentation is the task where one is asked to predict a semantic label per each pixel in the image. Although similar in nature to image classification, it comprises several difficulties - one of which is dealing with variable input and output image sizes. First approaches employed sliding window methodology on fixed-size inputs [21, 26] until Long et al. 
proposed a fully convolutional variant of image classifiers. This became the standard choice of solving any per-pixel tasks, and semantic segmentation, in particular, has witnessed a significant progress partially due to the development of end-to-end probabilistic graphical models[28, 16], and partially due to the advances in different structures and contextual modules [24, 27, 3].
Here, we consider two state-of-the-art networks, DeepLab-v3 , and ResNet-38 , on two popular benchmarks, PASCAL VOC , suitable for general segmentation, and CityScapes , for more specific driving applications. DeepLab-v3 is a successor of the original DeepLab network  with the inclusion of encoder and a contextual module, containing dilated convolutions. ResNet-38 was proposed as an alternative to the original residual networks  after a careful analysis conducted to evaluate the trade-off between depth and width of a convolutional network.
The authors provide the models not pre-trained on the validation sets, on which these networks show comparable results, allowing us to make a fair comparison. We should also note that in this work we do not aim to attribute a particular rise in performance to any of structural and architectural advances, rather our goal is to underline similarities, strong and weak spots across different models.
2.2 Diagnostics of computer vision methods
The role of diagnostics in computer vision has often been overlooked in recent years. Nevertheless, some pivotal works cannot go unnoticed.
For object detection, Hoiem et al.  compared two state-of-the-art models by analysing their false positives and false negatives on the PASCAL VOC dataset . The analysis was based on different properties of the object instances, such as occlusion, truncation, visibility and size. From it, the authors were able to pinpoint strengths and weaknesses of the methods, in particular, the sensitivity to large and small objects, and different levels of occlusion. We are primarily motivated by this work, and aim to extend it for semantic segmentation, but we consider per-instance annotations only as other properties are not well-annotated.
Later, Hariharan et al.  conducted a similar analysis for object detection by analysing error modes and sources of false positives. They concluded that mislocalisation was the single most influential source of errors for object detectors.
Most recently, Ronchi & Perona 
built a diagnostics framework for multi-instance pose estimation. Specifically, they proposed a taxonomy of false positive localisation errors, which includes‘miss’, ‘swap’, ‘jitter’ and ‘inversion’. Based on the proposed taxonomy, they evaluated two state-of-the-art models and underlined which error modes were of the most influence. They concluded that, besides missing keypoints, those models were also suffering from noise in confidence scores, which negatively affected their performance.
On a related note, there have been multiple attempts proposing the most suitable set of evaluation metrics for semantic segmentation[8, 23, 22]. For example, Csurka et al.  argue that dataset-level metrics are less meaningful than image-level ones, as the latter allow to statistically quantify differences between two methods and better analyse their performance. We partially follow this approach and, besides reporting global per-pixel accuracy and intersection-over-union, we also report per-instance accuracy, which enables us to reason about performance across different instance properties, such as size and aspect ratio.
3.1 Object Characteristics
As noted by both Hoiem et al.  and Harihan et al. , object characteristics tend to have a large impact on the model performance. Motivated by this, we first consider how sensitive each network is to size and aspect ratio of object instances. As we have access to instance annotations, we follow Hoiem et al.  and divide the instances of each class into one of five size categories based on the number of pixels that each instance has - extra-small (XS: bottom ), small (S: -), medium (M: -), large (L: -) and extra-large (XL: top ). Analogously, we divide instances into five aspect ratio () categories - extra-tall (XT), tall (T), medium (M), wide (W), extra-wide (XW). When considered together, these definitions cover a broad spectrum of different instances - from sparse but spacious to dense but tiny. We demonstrate examples on Fig. 1.
Using such groupings, we are able to quantify model performance in terms of size and aspect ratio of each class instance.
3.2 Error Taxonomy
We move on to describe different sources of errors, namely, mislocalisation - when the predicted label is incorrect, but it does exist in the ground truth mask in close vicinity (here we consider a square patch centered at the prediction point); and confusion with other labels. Confusion can be of three different types: confusion with semantically similar classes - when the predicted label shares a subclass with the ground truth label; confusion with background (in case such a label exists) - when the predicted label is background, but the ground truth one is not; and confusion with semantically dissimilar classes - in all other cases. For grouping of semantically similar classes on PASCAL VOC, we follow Hoiem et al. , whereas for CityScapes we make use of the provided hierarchy (Table 1).
Such an error taxonomy enables us to evaluate the impact of each type of error separately, potentially leading to a specific algorithm built to alleviate the effect of each group.
|Animals+Human||cat, cow, dog, horse, person, sheep|
|Furniture||chair, sofa, table|
|Vehicles||bicycle, boat, bus, car, mbike, train|
|Construction||building, wall, fence|
|Object||pole, traffic light, traffic sign|
|Vehicles||car, truck, bus, train, motorcycle, bicycle|
3.3 Quantifying Uncertainty
Uncertainty is an important part of any functional system, and knowing sources of it might shed light on system’s behaviour.
Here, we exploit the softmax approximation to acquire per-pixel probabilities of each class from the model’s outputs, and, based on it, define two notions of uncertainty - per-pixel relative entropy and relative probability difference between top-and top- predicted classes:
where is the set of semantic classes, and is the predicted probability of class at the given pixel. Both measures range from to , where the higher values denote the higher uncertainty of the model in its own predictions. By tying up such means of uncertainty with the error types and object characteristics defined above, we are able to answer the following sorts of questions: if the instance is undetected (confused with background), how much does its uncertainty deviates from the average one? Or how does uncertainty differ across objects of various sizes?
We consider two datasets - PASCAL VOC  and CityScapes , and two networks - DeepLab-v3111https://github.com/tensorflow/models/tree/master/research/deeplab  and ResNet-38222https://github.com/itijyou/ademxapp . PASCAL VOC contains a wide spectrum of semantic classes (plus additional ‘background’ label), and has images for validation, whereas CityScapes includes validation images annotated with semantic classes. While VOC provides instance-level annotations for all the classes, CityScapes has them present only for classes - ‘bicycle’, ‘bus’, ‘car’, ‘motorcycle’, ‘person’, ‘train’, ‘truck’ and ‘rider’. We do not alter or fine-tune the provided weights in any way, and only amend the pre-processing steps to not include any rescaling of the input image and the post-processing step to only include bicubic upsampling of the score maps to the original size. In the interests of brevity, for each dataset and each defined terminology, we only discuss most interesting results on a subset of classes; we provide results for all the classes in our supplementary material333https://cv-conf.shinyapps.io/diag-sem-segm/.
We report two well-established types of quantitative measures for semantic segmentation - pixel accuracy () and intersection-over-union (), where is the number of pixels belonging to class while being predicted as class and is the total number of pixels belonging to class . When possible, we report average pixel accuracy across instances (as opposed to global values across the whole validation set). Results of the models across the validation sets are given in Table 2 for VOC and in Table 3 for CityScapes, respectively.
4.1 Object Characteristics
We examine classes ‘bottle’, ‘car’, ‘person’ and ‘sofa’ (VOC), and ‘motorcycle’, ‘rider’, ‘train’ and ‘truck’ (CityScapes), as these classes illustrate well the behaviour of model performance in terms of object characteristics among all the classes.
Both models exhibit similar behaviour, under-performing on categories of small sizes and extreme aspect ratios, and preferring as larger instances as possible without significant variation in aspect ratio (Fig. 2). Nevertheless, there are some differences: e.g., DeepLab-v3 completely misses extra-small bottles and does steadily losses several points against ResNet-38 on all other classes with S or XS instances (except for small sofas) on VOC (Fig. (a)a). Even though DeepLab-v3 steadily outperforms ResNet-38 across most categories in terms of globally computed metrics (Tables 2 and 3), it does so mostly on average instances. Surprisingly, ResNet-38 seems to perform poorly on all trains except the wide ones (Fig. (b)b).
Sensitivity of DeepLab-v3 and ResNet-38 to instance size on PASCAL VOC (a) and aspect ratio on CityScapes (b). Black diamond points with numbers represent average per-instance accuracy of given class and category; red lines indicate standard error bars, where black dashed lines denote average per-instance accuracy of the class (across all categories)
Similar performance of two different models urges us to look at the distribution of different categories as present in the training set. As both of the models were using weights pre-trained from ImageNet, and later fine-tuned for semantic segmentation on MS COCO  with possibly different choices of training data, we only consider the training set of PASCAL VOC augmented with annotations from BSD  (in total, images) in case of VOC, and the training set of CityScapes with images. Comparing the performance results with the distribution across categories (Fig. 3), we note that there does seem to be the lack of extra-small instances (although not for the problematic class ‘bottle’), along with the shortage of instances with extreme aspect ratio (pointing to the similarity with the validation set).
In an attempt to improve performance on instances of small and extra-small sizes, we conduct a simple experiment on classes ‘car’ and ‘bottle’ from VOC using DeepLab-v3. Concretely, we select all the images from the validation set that contain only instance of each of those classes belonging to either small (S) or extra-small (XS) size categories. In total, we found such images - with cars (S and XS) and with bottles (S and XS). We propagate each image through the DeepLab-v3 network, and for each we consider a score map corresponding to the class present (i.e., either ‘car’ or ‘bottle’). Inside the score map we find the point with the highest activation score, and do a square crop (of size ) around that point in the original image. We perform bicubic upsampling of the crop and propagate it through the network. Afterwards, we replace original predictions inside the cropped region with the new scores, and take the index of the highest class as the predicted label. This significantly improves the performance on small and extra small instances (Table 4), finding objects that were treated as background during the first forward pass (Fig. 4).
Alternatively, to provide a baseline for our simple approach, we consider the case when there exists access to ground truth annotations. Under this scenario, we crop the image around bounding box corresponding to the ground truth segmentation enlarged by pixels in each direction, and upsample the crop
using bicubic interpolation before feeding it into the network. As expected, this does further improve performance (Table4), though it is very close to the one achieved without knowledge of ground truth.
|Method||Class||Ins.-Wise Acc.,%||Total, %|
|Crop around max. act.||Bottle||32.61||41.51||34.81||39.67|
|Crop around GT bbox||Bottle||52.76||56.63||51.28||57.20|
Failure to recognise objects with extreme characteristics is prevalent across different domains, including object detection [12, 10]. As shown by Li et al. , this is often due to feature mismatch between small and large instances. To alleviate such an imbalance, they trained a generative adversarial network so that the network would mimic the features of large objects on the small ones, effectively fooling the detector to recognise the small object as if it was large. The extension of this approach can easily be adapted for semantic segmentation. Additionally, as shown above, attention-based post-processing transformations may reduce the need to train a separate model to deal with peculiar object instances, and can be extremely helpful.
4.2 Error Taxonomy
Here we present the results for classes ‘bicycle’, ‘chair’, ‘dog’ and ‘sofa’ (VOC), and ‘bus’, ‘motorcycle’, ‘road’ and ‘traffic light’ (CityScapes), as these classes demonstrate well connections between model performance and different types of errors among all the classes.
If the model is confused, it is more likely to be confusion with background, if such a class exists, or with semantically dissimilar classes (Fig. 5). Class ‘bicycle‘ is rarely confused with other vehicles, while ‘chair’ and ‘sofa’ are often confused with each other (Fig. (a)a). For CityScapes, the situation is similar, although not for ‘road’, where the proportion of errors is practically equal (Fig. (b)b). In some applications, one can treat predictions of semantically similar classes as belonging to a single class, and if we were to follow this approach, we would witness significant gains across the chosen classes (Fig. 6). As evident from Fig. (b)b, class ‘road’ experiences a smaller performance improvement even with a large proportion of errors caused by confusion with semantically similar classes. This happens as performance on this class is already high - almost (Table 3), and the number of errors to be corrected is small.
We further look at how mislocalisation errors affect the models. For VOC, we consider square crops centered at the point of prediction with half-side lengths being equal to , , , and pixels; for CityScapes that contains high-resolution images those values are , , , and pixels, respectively. If the predicted label does exist inside the square crop of the ground truth map, then the prediction is considered to be correct. From Fig. 7, we can easily notice that for all the chosen classes across the datasets even the window with the smallest size gives rise to a significant boost - from around to even more than , for both accuracy and IoU. This is tangible, and in situations where time for additional post-processing might be available, it can be exploited to correct mislocalisation errors, for example, by analogy to a simple zoom-in strategy outlined in Sect. 4.1.2.
Apart from choosing to sacrifice any differentiation between semantically similar classes, or to lose exact per-pixel classification, we propose another straightforward approach able to eliminate both types of errors. In particular, we consider the effect of selecting top- predictions at each pixel and looking at whether any of top- predicted labels is the ground truth one. For practical applications, this can make a huge difference, effectively reducing the search space to only few labels and treating additional post-processing steps as simply binary or ternary classification. For example, in medical imaging, a segmentation model might be initially solving the labelling problem with more than classes and solely relying on one class prediction with somewhat unstable performance, might not be the best strategy available. In contrast, as evident from Table 5, even considering top- classes with highest scores does significantly boost the numbers, possibly leading to better performance with auxiliary post-processing.
Confusion with background and with dissimilar classes tends to occupy the largest portion of errors committed by the model. To eliminate the effect of background, one might consider to divide it into more classes, effectively providing more information for the model to learn from. E.g., Hu et al.  proposed to exploit a transfer function to acquire semantic segmentation for
classes having only bounding box annotations. Furthermore, considering structured loss functions can help in alleviating the effect of the errors.
First of all we note that for the uncertainty experiments we make comparisons within the model itself, not between the models, as the scale factor of logits in all the models is different causing the softmax probabilities and, consequently, relative entropy and relative probability to be different, as well. We present the results for classes ‘bird’, ‘cat’, ‘chair’, ‘sheep’ (VOC), and ‘bicycle’, ‘bus’, ‘person’, ‘train’ (CityScapes), as these classes showcase diverse patterns prevalent among all the classes.
Relative entropy on the validation set of PASCAL VOC (a) and relative probability on the validation set of CityScapes (b) as function of distance to the boundary (in pixels). The lower line of the box denotes the lower quartile (), the black line inside the box depicts the median value, and the upper line of the box shows the upper quartile ()
We first look at how the defined notions of uncertainty behave on instances of different sizes and aspect ratios (Fig. 8). The models tend to be less certain about smaller objects with extreme aspect ratio, which is inversely proportional to the behaviour of accuracy against object characteristics (Fig. 2). The tendency to be more certain on average about larger objects can be explained by the distance to the boundary from the point of prediction - for larger objects, this distance tends to be bigger, which, in turn, make uncertainty smaller (Fig. 9).
Additionally, we consider how uncertainty differs across the range of different types of errors, as well as the average uncertainty per instance. As evident from Fig. 10, even when the models commit errors, their uncertainty might signal about the error, which may be helpful in lots of scenarios. In particular, relative probability tends to be the highest at the points with mislocalisation errors (of radius for PASCAL VOC and for CityScapes), followed by confusion with semantically similar classes (Fig. (a)a), signalling that top- scores are practically equal. In contrast, relative entropy seems to be the highest on semantically similar classes closely followed by mislocalisation errors (Fig. (b)b), indicating that there is no clear winner class amongst predictions.
As uncertainty seems to take higher values when the model commits an error, here we take a closer look at the ability of uncertainty to differentiate between foreground and background on PASCAL VOC. To this end, we consider relative entropy and relative probability computed on images from the validation set using ResNet-38. We treat pixels with uncertainty higher than the image average as ‘foreground’ pixels, and all others as ‘background’. We compare the resultant masks against ground truth segmentations, and demonstrate our results in Table 6. Both uncertainty based foreground-background predictors achieve solid accuracy, but the method using relative entropy suffers from a large number of false positives, as evident from precision, while the one with relative probability has a lower recall, signalling about a large number of undetected foreground pixels. Our simplistic way of thresholding is, of course, a subject of further improvements.
Exploiting uncertainty is becoming a topic of its own in deep learning , and we encourage practitioners and researchers to be aware of it. For semantic segmentation, additional post-processing techniques based on uncertainty measures may alleviate certain types of errors and might signal about a missing object, or even about a new unseen class.
5 Discussion & Conclusions
In this paper, we approached the question of diagnostics in semantic segmentation. This is an extremely broad area of research, and we believe that for further advances in the field we will need to get a better grasp on the current advances that we have. To this end, we laid out the extensive (but by no means the exclusive) categorisation of most prevalent sources of errors in semantic segmentation, along with novel points considering two types of uncertainty, as well as simple extensions. Our findings signal that the performance of semantic segmentation models has indeed reached high levels, and future advances should be concerned with how to unite different types of annotated data instead of pursuing expensive per-pixel labellings; how to exacerbate the effect of particular error sources; and how to make use of uncertainty to improve the stability of the model. We hope that this report will provide inspiration for a broader research into the question of how exactly segmentation models achieve such extraordinary results, as well as will bring more advances into the area.
Besides the above findings, we believe that an efficient usage of the models that we have now (i.e., transfer learning) must be explored further along with the notions of uncertainty for learning new objects and classes. We aim to pursue and address those directions in our future research.
Berman, M., Rannen Triki, A., Blaschko, M.B.: The lovász-softmax loss: A tractable surrogate for the optimization of the intersection-over-union measure in neural networks. In: CVPR (2018)
-  Chen, L., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. CoRR abs/1606.00915 (2016)
-  Chen, L., Zhu, Y., Papandreou, G., Schroff, F., Adam, H.: Encoder-decoder with atrous separable convolution for semantic image segmentation. CoRR abs/1802.02611 (2018)
-  Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR (2016)
-  Csurka, G., Larlus, D., Perronnin, F.: What is a good evaluation measure for semantic segmentation? In: BMVC (2013)
-  Deng, J., Dong, W., Socher, R., Li, L., Li, K., Li, F.: Imagenet: A large-scale hierarchical image database. In: CVPR (2009)
-  Everingham, M., Gool, L.J.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes (VOC) challenge. International Journal of Computer Vision 88(2), 303–338 (2010)
-  Freixenet, J., Muñoz, X., Raba, D., Martí, J., Cufí, X.: Yet another survey on image segmentation: Region and boundary information integration. In: ECCV (2002)
-  Hariharan, B., Arbelaez, P., Bourdev, L.D., Maji, S., Malik, J.: Semantic contours from inverse detectors. In: ICCV (2011)
-  Hariharan, B., Arbeláez, P.A., Girshick, R.B., Malik, J.: Simultaneous detection and segmentation. In: ECCV (2014)
-  He, K., Zhang, X., Ren, S., Sun, J.: Identity mappings in deep residual networks. In: ECCV (2016)
-  Hoiem, D., Chodpathumwan, Y., Dai, Q.: Diagnosing error in object detectors. In: ECCV (2012)
-  Hu, R., Dollár, P., He, K., Darrell, T., Girshick, R.B.: Learning to segment every thing. CoRR abs/1711.10370 (2017)
-  Kendall, A., Gal, Y.: What uncertainties do we need in bayesian deep learning for computer vision? In: NIPS (2017)
-  Li, J., Liang, X., Wei, Y., Xu, T., Feng, J., Yan, S.: Perceptual generative adversarial networks for small object detection. In: CVPR (2017)
-  Lin, G., Shen, C., van den Hengel, A., Reid, I.D.: Efficient piecewise training of deep structured models for semantic segmentation. In: CVPR (2016)
-  Lin, T., Maire, M., Belongie, S.J., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft COCO: common objects in context. In: ECCV (2014)
-  Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR (2015)
-  Ronchi, M.R., Perona, P.: Benchmarking and error diagnosis in multi-instance pose estimation. CoRR abs/1707.05388 (2017)
-  Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomedical image segmentation. In: MICCAI (2015)
Schroff, F., Criminisi, A., Zisserman, A.: Object class segmentation using random forests. In: BMVC (2008)
-  Shotton, J., Winn, J.M., Rother, C., Criminisi, A.: TextonBoost: Joint appearance, shape and context modeling for multi-class object recognition and segmentation. In: ECCV (2006)
-  Unnikrishnan, R., Pantofaru, C., Hebert, M.: A measure for objective evaluation of image segmentation algorithms. In: CVPR (2005)
-  Wu, Z., Shen, C., van den Hengel, A.: Wider or deeper: Revisiting the resnet model for visual recognition. CoRR abs/1611.10080 (2016)
-  Xu, H., Gao, Y., Yu, F., Darrell, T.: End-to-end learning of driving models from large-scale video datasets. In: CVPR (2017)
-  Yin, Z., Bise, R., Chen, M., Kanade, T.: Cell segmentation in microscopy imagery using a bag of local bayesian classifiers. In: ISBI (2010)
-  Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR (2017)
-  Zheng, S., Jayasumana, S., Romera-Paredes, B., Vineet, V., Su, Z., Du, D., Huang, C., Torr, P.H.S.: Iccv (2015)