Image segmentation describes the demanding task of simultaneously performing object recognition and boundary segmentation and is one of the oldest problems in computer vision[18, Ch. 5]. It is also often a crucial part of many visual understanding systems.
Recent advances of deep learning models resulted in a fundamental change in conjunction with remarkable performance improvements. However, to train these models, highly accurate labeled data in sufficient large numbers is mandatory. The goal of our method, depicted inFigure 1, is to circumvent this cumbersome task, by going some extra miles during inference. To do so, we got inspiration from the field of Explainable AI (XAI) by Layer-wise Relevance Propagation (LRP), presented first by . LRP is usually used to highlight pixels contributing to the decision of a classification network and to get further insights into the decision making process. In this initial work, we focus on binary semantic segmentation. We train a VGG-A network architecture to assign an input image to one of two classes. Afterwards, we use LRP to highlight the pixel contributing to the decision of the network and investigate three segmentation techniques to generate the final segmentation mask. This idea enables a weakly supervised training of a segmentation method, which needs only image-wise labeled data to train a classification network, but outputs a pixel-wise segmentation mask. Further, we show that our approach yields comparable results to dedicated segmentation networks, but without the cumbersome requirement of pixel-wise labeled ground-truth data for training. We put our approach to test on two different datasets - example images can be found in Figure 2 and Figure 3.
In the remaining paper, we will summarize related publications in section 2, followed by an explanation of our system in section 3. In section 4 we evaluate against a classical semantic segmentation network and discuss several design options and their performance impacts.
2 Related Work
In the past decades, generations of researchers have developed image segmentation algorithms and the literature divides them into three related problems. Semantic segmentation describes the task of assigning a class label to every pixel in the image. With instance segmentation, every pixel gets assigned to a class, along with an object identification. Thus, it separates different instances of the same class. Panoptic segmentation performs both tasks concurrently, producing a semantic segmentation for “stuff” (e.g. background, sky) and “things” (objects like house, cars, persons).
There are classical approaches like active contours , watershed methods  and graph-based segmentation , to only name a few. A good overview for these classical approaches can be found in [18, Ch. 5]. With the advance of deep learning techniques in the area of image segmentation, new regions in terms of robustness and accuracy can be reached. One can find a comprehensive recent review of deep learning techniques for image segmentation in 
. This review includes an overview over different architectures, training strategies and loss functions, so we refer the interested reader there, to get a current overview. All these approaches share one drawback: they need pixel-wise annotated images for training.
In the field of Explainable AI (XAI), several algorithms for the explanation of network decisions have been proposed. The authors of 
proposed a method called Integrated Gradients, where gradients are calculated for linear interpolations between the baseline (for example, a black image) and the actual input image. By averaging over these gradients pixel with the strongest effect on the model’s decision are highlighted. SmoothGrad generates a sensitivity map by averaging over several instances of one input image, each one augmented by added noise. This way, smoother sensitivity maps can be generated. There are many more methods and a comprehensive review would be out of scope for this paper, but we refer the interested reader to [8, 2, 13]. From this bunch of methods, we selected LRP as it results in comparatively sharp heatmaps and is therefore the best starting point for the generation of segmentation maps. Furthermore, the authors of  showed that with a simple extension LRP can be utilized to localize traces of forgery in morphed face images.
Massive amounts of pixel-wise labeled segmentation masks are usually the training foundation of deep neural networks for image segmentation tasks. Obtaining them is usually a tedious manual process, introducing intrinsic problems by itself, because of the variance in the annotations and the often fuzzy definitions of object boundaries. Therefore, we come up with an indirect way. In our approach, we train a classification network instead of a segmentation network, consequently reducing the annotation work by a great margin and removing the requirement of an exact definition of object boundaries. To segment an image, we first pass it through the classification network, seesection 3.2.1, which outputs if the object we want to segment is in the image or not. If the first is true we pass the classification network a second time, but this time from the back to the front, using the LRP technique described in section 3.3 , which is an XAI technique, highlighting the pixel contributing to the DNN’s decision in a heatmap. We use this heatmap in order to generate a segmentation mask without the cumbersome task of manual pixel-wise labeling.
3.2 Network Architectures
For binary classification, we resort to the classical VGG-11 architecture without batch normalization, as described in
, but with only 128 neurons in each of the fully connected layers. We use this architecture, as it is readily available, well understood and a good starting point for the use of the LRP framework
. For further improvements of the segmentation results generated with LRP, we also adapt the VGG-11 architecture, connecting the outputs of the last max-pooling layer directly to the two output neurons. This facilitates the classification accuracy, as well as the indirect segmentation (seesection 4). For both configurations, we start the training with pretrained weights for the convolutional and randomly initialized weights for the fully connected layers. We use a learning rate of 0.001 and 0.0001 for the fully connected layers and the refined convolutional layers, respectively.
3.2.2 Native Segmentation
For a comparison of our proposed method against an established network architecture for image segmentation, we have chosen the U-Net architecture as described in , as it was developed especially for very small datasets (the authors of the original work used only 30 training samples). We train the network, as described in the original work, in a classical supervised way using data augmentation techniques like affine transformations and random elastic deformations to cope with the small training datasets of 60 and 119 for the sewer pipes and the magnetic tiles, respectively.
Layer-wise Relevance Propagation (LRP)  is an interpretability method for DNNs. It was designed to highlight on a pixel-level the structures in an image that are relevant for the DNN’s decision. To this end, it assigns to each input neuron of a DNN, e.g. each pixel of an image, a relevance score that reflects its impact on the activation of a class of interest. A positive relevance score denotes a contribution to the activation, while a negative relevance score denotes an inhibition.
In a first step, LRP assigns a relevance value to a starting neuron that represents the class of interest. Given this initialization, LRP propagates this starting relevance backwards through the DNN into the input image. To this end, it iterates from the last layer to the input image through all layers. In each iteration step, it assigns relevance scores to neurons in the current layer based on their activations and weights to neurons in the subsequent layer and the neurons’ relevance scores in the subsequent layer. If a neuron contributes to an activation of a neuron in the subsequent layer, it receives a percentage of its relevance. If it inhibits the activation, it gets a negative relevance percentage. LRP can make use of different rules for the propagation of relevance from neurons in one layer to the neurons in its predecessor. The rules define how LRP propagates the relevance in every single step and consider various properties of the activations and connections in different parts of the DNN.  shows that the most accurate and understandable relevance distributions can be achieved by used different rules for different parts of a DNN.
|Rule names||LRP Ours||LRP Montavon|
In our experiments, we use two different sets of LRP rules. An overview is given in Table 1. The parameters for the rules are and . In both cases, we use the LRP- rule for the first layer, which maps the relevance into the image. The LRP-
rule considers that it has to propagate the relevance to pixels that contain real values and not ReLU activations like the neurons in the DNN. The first combination of rules and parameters has been shown to be suitable for VGG-like architectures
. While the LRP-0 rule is close to the activation function of the network, the LRP-rule focuses on more salient features and the LRP- rule is most suitable to spread the relevance to the whole feature. Empirically, we found a suitable second combination of rules, which we included in our experiments. It uses the LRP- rule for the fully connected layers to focus already here on more salient features. The use of the LRP- rule for the middle layers enforces an early spread of the relevance to the whole feature. The LRP-- rule with in the lower layers considers inhibiting and contributing features differently and puts a strong focus on contributing activations. This rule leads to more balanced relevance maps with strong focus on inhibiting regions. For further details on LRP and the characteristics of its different relevance propagation rules, we refer to .
3.3.2 Segmentation from LRP
LRP assigns to each pixel a relevance score with values within an arbitrary interval. In order to transfer these relevance distributions to a segmentation map, we developed three different approaches. The first one Simple is based on the simple automatic thresholding algorithm described in . The other two approaches, GMM and BMM
, are based on Mixture Models and optimized using an Expectation-Maximization algorithm.
To calculate the segmentation of foreground and background, the LRP activations are normalized first and the mean defines the initial threshold over all the activations. Based on the initial threshold, the image can be separated into foreground and background and the mean over all values in both classes is calculated. The new threshold then arises from the average of both mean values. This iterative refinement stops if the threshold value converges.
This segmentation method is based on a Gaussian Mixture Model (GMM) to distinguish between relevant regions (damages) and background. In a first step, we apply a 2D mean filter with a dimension of five by five on the relevance map. This smooths out extreme relevance peaks for single pixels and makes the relevance distribution in the inner part of a damage smoother. Our GMM has three components and is fit to the relevance distribution considering only the 1-D relevance scores and no spatial information. We used python’s scikit-learn package
to fit the GMM to the data. We initialize the GMM using the k-Means algorithm to find first belongings of the samples to the distributions and thus to initialize the parameters. In order to identify the component of the GMM that describes the relevance values covering the damages, we selected the component with the largest likelihood for the maximal relevance value. The final segmentation map is calculated by selecting all pixels that belong to this component with a likelihood of more than 50%.
Our Beta Mixture Model for segmentation consists of two Beta-distributions. See (1
) for the definition of the probability density function of the Beta-distribution.
with being the normalization factor as defined in (2)
is the Gamma function. The idea of this model is to use two skewed distributions to describe the relevance scores. One distribution characterizes the large amount of background pixels with low positive relevance values and the other one the areas containing damages with large relevance values. The distributions are fit using an EM-algorithm with outlier removal and weights for the relevance distributions. The details are described in the following.
First, the relevance maps are filtered as in our GMM segmentation approach. In a next step, we set all negative relevance values to zero and remove 50% of the smallest relevance values, since the damages cover in all cases significantly less pixels than 50% of the image. Subsequently, the relevance scores are normalized to the interval [0,1], since the Beta-distribution is defined only on this interval. To this end, we map the smallest value to zero and the largest to one using an affine transformation. The BMM is fit to these processed data using an EM-algorithm with the following modifications in the expectation step.
Pixels with a larger relevance value than the probability density function’s mean of the component that represents the damages are assigned with a probability of 100% to this component. Pixels that are within the 90% of the lower sided confidence interval of the component that represents the background are assigned with a probability of 100% to this component. Finally, we weight the probabilities of each component by the sum of the component’s probabilities. These modifications in the maximization step avoid that the large number of small relevance values from the background pixels affects the component that describes the relevance values of the damaged areas and compensates the big differences in the number of relevant (damage) and non-relevant (background) pixels.
To showcase the applicability of our proposed method, we work with two different datasets in our experiments, which we describe in detail below.
Cracks in Sewer Pipes
Sewer pipe assessment is usually done with the help of mobile robots equipped with cameras. In our case, the robot was equipped with a fisheye lens, resulting in severe image distortions. Therefore, we performed a preprocessing of the original footage, as described in , consisting of camera tracking, image reprojection and enhancement. For the classification of damaged and undamaged pipe surfaces, we manually cropped 628 and 754 images respectively, with a size of 224 by 224 pixels. The damages show a huge variety in size, color and shape, as can be seen in Figure 2. During the training of the classification network, we performed no further data augmentation. For the training of a classical segmentation network, we cropped another set of images containing pipe cracks and manually created the segmentation masks. We divided the dataset into a testing and validation dataset, each with 20% of all images, and a training set with the remaining 60% of all images. During the training of the segmentation network, we used elastic deformations and affine transformations for data augmentation . The testing and validation data are augmented by adding the horizontally and vertically flipped version of each image to the corresponding set.
Cracks in Magnetic Tiles
As a second dataset, we use the magnetic tile defect datasets provided by the authors of . This set contains 894 images of magnetic tiles without any damage and 190 images of magnetic tiles with either a crack or a blowhole. These damages are very small and cover only a few percent of an image. The images in this dataset are of different sizes with widths between 103 and 618 pixels and heights between 222 and 403 pixels. We manually cropped all images with damaged magnetic tiles, such that each random crop of a 224 by 224 pixels large region contains the damage. During training, the images are randomly cropped to a size of 224 by 224 pixels, while during testing and validation we cropped the center of the images. Images with a height or width smaller than 224 have been rescaled to reach the minimal size of 224 pixels. We divided the dataset into a testing and validation dataset, each with 20% of all images, and a training set with the remaining 60% of all images. We ensured that this distribution holds also for all damage types and damage free images. When splitting the images into these sets, we considered that the authors of the dataset captured each damage and damage-free region up to six different times and split the images such that each damage or damage-free region area is in one set only. We augmented the data during training using random cropping and random horizontal and vertical flipping. The testing and validation data are augmented by adding the horizontally and vertically flipped version of each image to the corresponding set. Example images can be found in Figure 3.
We tested the proposed LRP-based segmentation methods on the Sewer Pipe Cracks and the Magnetic Tile Cracks dataset and compared their performance with the segmentation results from a U-Net. We evaluated all combinations of our three proposed threshold estimation methods, LRP rules and VGG-based networks to study their suitability for image segmentation. The used evaluation metrics are Intersection over Union (IoU) and the two for binary classification tasks common metrics, precision and recall. Since this is a binary problem, we calculated all metrics only for the pixels segmented as damage. Furthermore, we analyzed the Precision-Recall (PR) Curves of the different segmentation approaches. Whereas the U-Net, the BMM approach and the GMM approach output a value for the likelihood that a pixel shows a damaged area, the simple approach outputs only a binary decision and no PR curve can be calculated for this approach. In exchange for the simple approach, we calculated the PR curve on the LRP-output after mapping it into the interval [0,1] using an affine transformation.
The PR curves contain striking horizontal lines, which origin is explained in the following. The GMM approach assigns to a large number of pixels a likelihood of one for being a damage. Two components of the GMM have a mean around zero and very small variances. The third component, which describes the damage, has a significantly larger variance and mean. Due to the narrow peak of the first two components that describe non-damage pixels, their probability density functions are already zero for moderate relevance scores, when using 64 bit floating point numbers as defined in IEEE 754-2008. Especially, for the Magnetic Tile dataset, the contrast of the large amount of background pixels with relevance scores close to zero and the small amount of pixels showing damaged areas with large relevance scores causes this behavior. It can also be observed for the Sewer Pipes dataset, but to a smaller extent. Increasing the threshold can thus not improve the precision or change the recall. We depicted this point of saturation with a horizontal line.
4.1 Magnetic Tiles
Both VGG-A-based networks achieve in all cases a good performance in detecting damages and yield a balanced accuracy of more than 95%, see Table 2.
|VGG-A one FC||98.5%||97.1%||99.9%|
Table 3 shows that some of our LRP-based segmentation approaches perform as well as the U-Net segmentation, which requires a pixel-wise segmentation for training. The performance of our proposed approaches differs strongly in terms of IoU, precision and recall. The GMM approach has the worst IoU and precision but by far the best recall. The recall of the simple approach is in general better than for the BMM approach, but the BMM provides a better precision. There is no model that outperforms all others in all metrics. Which approach to choose depends on whether the detector should focus on sensitivity or specificity.
The precision-recall curves in Figure 4 and Figure 5 show that the different results for the metrics in Table 3 for U-Net, BMM and GMM are not only a matter of threshold, but the approaches perform differently depending on the selected theshold. All our approaches can achieve better results than a no-skill segmentation system with a precision of 0.006. The BMM approach performs in nearly all, except for a very high recall rates, better than the GMM approach. It outperforms the U-Net based segmentation in the recall interval of roughly 0.7 to 1 in the best setting with the VGG-A 128N DNN and our proposed LRP-ruleset. In the remaining range of 0 to 0.7 its precision is on average only 0.1 worse than the U-Net segmentation. In all cases, the GMM model reaches very fast a point of saturation with final precision scores between about 0.15 and 0.3, depending on the used LRP-ruleset and DNN model. An explanation for this saturation can be found in the beginning of section 4.
Whether our proposed LRP-ruleset or the ruleset proposed in  is more suitable for a LRP-based segmentation depends on the approach used for the final segmentation step. While a raw relevance intensity based and GMM-based segmentation approach yields in general better results with the ruleset proposed in , the best results are achieved using our proposed ruleset and the BMM approach using the VGG-A 128N model.
Figure 9 depicts examples for the damage segmentation of magnetic tile images using U-Net as well as our proposed approach. The LRP-results in the first two rows show a typical weakness of LRP-based segmentations. The relevance scores at the borders of small damages are very large, but inside the defect, they are small and close to zero. Thus, a non-sensitive approach does not segment the inner part of a damaged area as such. The LRP-based segmentation in the first row does not contain the complete contour of the damage, which is caused by smaller relevance values at one part of the damage’s border. This problem can be solved by adjusting the threshold for the segmentation to make the approach more sensitive. The example in the bottom row shows that our LRP-based approach is also suitable to segment more complex structures than the ellipsoids showed in the other three examples. In general, the LRP-based and U-Net Segmentation results can describe the position as well as the shape of the damage with a visually comprehensible precision.
|LRP Ours||LRP Montavon|
|VGG-A 128N||IoU||Precision||Recall||VGG-A 128N||IoU||Precision||Recall|
|VGG-A one FC||IoU||Precision||Recall||VGG-A one FC||IoU||Precision||Recall|
4.2 Sewer Pipes
The VGG-A-based classification networks yield more than 95% balanced accuracy for both network configurations. By directly connecting the convolutional layers to only two output neurons, a slight performance increase could be achieved, as can be seen in Table 4.
|VGG-A one FC||97.4%||98.5%||96.2%|
Table 5 contains the segmentation metrics for the Sewer Pipe Cracks dataset. From our approaches, the BMM-based segmentation has the highest IoU values and also outperforms U-Net, when used in conjunction with VGG-A one FC. This configuration also has the highest precision, but lower recall values. The usage of the GMM-based approach shows similar results compared to simple thresholding algorithm.
The precision-recall curves for our LRP configuration and the one from  are plotted in Figure 7 and Figure 6. As can be seen from the figures, the utilization of the raw LRP output is not practicable, since the segmentation is barely better than a no-skill segmentation system with a precision of 0.05379, which is just the proportion of pixel labeled as cracks (represented by the black horizontal line).
The utilization of the proposed GMM and BMM based segmentations improves the results by a great margin. However, the GMM based approaches exhibit a straight horizontal line, for which an explanation can be found in the beginning of section 4. In the interval between a recall of roughly 0.5 and 0.7 our approach outperforms U-Net, but falls behind for lower recall values. For recall values greater than 0.7, VGG-A one FC and U-Net show a similar performance when utilizing our LRP ruleset. The usage of LRP Montavon results in a wider margin and also causes a worse performance for the VGG-A 128N configurations, which therefore never exceed the performance of U-Net.
For a visual comparison between U-Net and our configuration with VGG-A one FC and BMM, we refer to Figure 10. As can be seen in (c), U-Net generates many true positive crack segmentations, but gets distracted with strong brightness differences, for instance in the uppermost and lowest image. Deposits depicted in the image are sometimes also mistakenly segmented as cracks, as can be seen in the second image from the bottom. The LRP configuration is more robust against these issues, but the segmentations tend to be wider than the actual cracks, especially for narrow ones.
|LRP Ours||LRP Montavon|
|VGG-A 128N||IoU||Precision||Recall||VGG-A 128N||IoU||Precision||Recall|
|VGG-A one FC||IoU||Precision||Recall||VGG-A one FC||IoU||Precision||Recall|
We presented a method to circumvent the requirement of a pixel-wise labeling in order to train a neural network to accomplish this demanding task. In order to demonstrate the applicability of our approach, we compare different configurations against the established U-Net architecture and achieve comparable results using two different datasets. Thereby, we show that the output of the Layer-wise Relevance Propagation (LRP) can be exploited to generate pixel-wise segmentation masks.
An interesting further research direction could be the extension to multi-label segmentation. Currently, we also try to apply the proposed solution to the challenging problem of rain streak segmentation in natural images, as it is a very demanding task to generate a sufficient amount of training data and therefore is an ideal application area for our proposed method. Some early results of our work can be seen in Figure 8.
This work has partly been funded by the German Federal Ministry of Economic Affairs and Energy under grant number 01MT20001D (Gemimeg), the Berlin state ProFIT program under grant number 10174498 (BerDiBa), and the German Federal Ministry of Education and Research under grant number 13N13891 (AuZuKa).
On Pixel-Wise Explanations for Non-Linear Classifier Decisions by Layer-Wise Relevance Propagation. PLoS ONE 10 (7), pp. e0130140. External Links: Cited by: §1, §3.1, §3.2.1, §3.3.1.
Explainable artificial intelligence (xai): concepts, taxonomies, opportunities and challenges toward responsible ai. Information Fusion 58, pp. 82–115. External Links: Cited by: §2.
-  (2004) Efficient Graph-Based Image Segmentation. International Journal of Computer Vision 59 (2), pp. 167–181. External Links: Cited by: §2.
-  (2018) Surface Defect Saliency of Magnetic Tile. 2018 IEEE 14th International Conference on Automation Science and Engineering (CASE) 00, pp. 612–617. External Links: Cited by: §3.4.
-  (1988) Snakes: Active contour models. International Journal of Computer Vision 1 (4), pp. 321–331. External Links: Cited by: §2.
-  (2020) Towards best practice in explaining neural network decisions with lrp. In Proceedings of the IEEE International Joint Conference on Neural Networks (IJCNN), pp. 1–7. External Links: Cited by: §3.3.1.
-  (2018) Automatic Analysis of Sewer Pipes Based on Unrolled Monocular Fisheye Images. 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 2019–2027. External Links: Cited by: §3.4.
Explainable ai: a review of machine learning interpretability methods. Entropy 23. Cited by: §2.
-  (2021) Image Segmentation Using Deep Learning: A Survey. IEEE Transactions on Pattern Analysis and Machine Intelligence PP (99), pp. 1–1. External Links: Cited by: §2.
-  (2019) Layer-wise relevance propagation: an overview. In Explainable AI: Interpreting, Explaining and Visualizing Deep Learning, pp. 193–209. External Links: Cited by: §3.3.1, §3.3.1, §4.1, §4.2, Table 3, Table 5.
-  (2011) Scikit-learn: machine learning in python. Journal of machine learning research 12 (Oct), pp. 2825–2830. Cited by: §3.3.2.
-  (2015) U-Net: Convolutional Networks for Biomedical Image Segmentation. arXiv. External Links: Cited by: §3.2.2, §3.4.
-  W. Samek, G. Montavon, A. Vedaldi, L. K. Hansen, and K. Müller (Eds.) (2019) Explainable AI: interpreting, explaining and visualizing deep learning. Lecture Notes in Computer Science, Vol. 11700, Springer. External Links: Cited by: §2.
-  (2021) Feature focus: towards explainable and transparent deep face morphing attack detectors. Computers 10 (9). External Links: Cited by: §2.
-  (2015) Very Deep Convolutional Networks for Large-Scale Image Recognition. CoRR abs/1409.1556. Cited by: §3.2.1.
-  (2017) SmoothGrad: removing noise by adding noise. ArXiv abs/1706.03825. Cited by: §2.
-  (2017) Axiomatic attribution for deep networks. ArXiv abs/1703.01365. Cited by: §2.
-  (2011) Computer Vision, Algorithms and Applications. Texts in Computer Science, Springer. External Links: Cited by: §1, §2.
-  (2017) Digital Image Processing and Analysis with MATLAB and CVIPtools, Third Edition. CRC Press. External Links: Cited by: §3.3.2.
-  (1991) Watersheds in digital spaces: an efficient algorithm based on immersion simulations. IEEE Transactions on Pattern Analysis and Machine Intelligence 13 (6), pp. 583–598. External Links: Cited by: §2.