Effect of Superpixel Aggregation on Explanations in LIME – A Case Study with Biological Data

10/17/2019 ∙ by Ludwig Schallner, et al. ∙ 0

End-to-end learning with deep neural networks, such as convolutional neural networks (CNNs), has been demonstrated to be very successful for different tasks of image classification. To make decisions of black-box approaches transparent, different solutions have been proposed. LIME is an approach to explainable AI relying on segmenting images into superpixels based on the Quick-Shift algorithm. In this paper, we present an explorative study of how different superpixel methods, namely Felzenszwalb, SLIC and Compact-Watershed, impact the generated visual explanations. We compare the resulting relevance areas with the image parts marked by a human reference. Results show that image parts selected as relevant strongly vary depending on the applied method. Quick-Shift resulted in the least and Compact-Watershed in the highest correspondence with the reference relevance areas.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Especially in visual domains, deep Convolutional Neural Networks (CNNs) have shown their superior capabilities for object classification tasks such as semantic segmentation [7]

. For CNNs, as well as for other deep learning architectures, crucial requirements for real-world applications are that the learned classifiers (a) make accurate predictions and (b) that the systems’ decision making is transparent and comprehensible to humans

[15, 10]

. Explanations of a system’s decision making process can help machine learning experts to uncover unwanted biases. Additionally, for domain experts without a background in machine learning, explanations are crucial for being able to understand and trust the propositions of a classifier

[10]. Applications in the medical or pharmaceutical fields particularly require the trust of the end user, since a physician will not trust the decision of a black-box unless this decision is comprehensible

In the context of image classifications, many approaches for visual explanations have been proposed [22], such as LRP (Layer-wise Relevance Propagation, [2]) or LIME (Local Interpretable Model-Agnostic Explanations, [15]). For explaining image classifications, LIME relies on segmentation of the image into superpixels, that is on similarity based grouping of pixels into larger structures based on local features [4]. LIME by default applies the specific superpixel algorithm Quick-Shift. The segmentation of an image into superpixels is crucial for the generation of the explanation in LIME since perturbation of superpixels is used to identify which of the image areas has been relevant for a specific class decision. Therefore, we were interested in exploring whether different superpixel approaches have a significant impact on the kind of visual explanations generated by LIME. Furthermore – in the case of differences between the superpixel approaches – it is also of interest how similar these results are to reference assessments generated by a humans based on relevance.

As an application domain we focus on biological data which come in a huge variety of image types – from fine grained microscopic images to holistic images of plants and animals. Our two case studies focused on applications from the medical and biological field, namely the detection of malaria parasites in thin blood smear images [13] and the detection of stress in tobacco plants used for pharmaceutical purposes [17].

In the following, we will first recapitulate the basic concepts of LIME. Afterwards, we will introduce a variety of superpixel approaches which are well known in computer vision. Furthermore, we present the malaria domain and evaluation results – showing differences of LIME’s relevance explanation for the considered superpixel approaches and similarity to the relevance selection. Additionally, we shortly present and discuss the tobacco domain. We conclude with a short discussion and further work to be done.

2 Visual explainability with LIME

LIME [15] is an explanation framework for the decision of any machine learning classifier. In the original implementation it is capable of processing classifiers that either have text, images or tabular data as input. In this work we focus on the explanation of decisions for classifiers that process image data. The output of LIME therefore is a set of connected pixel patches along with a weighting for each patch. These weights indicate how strong a patch is correlated with the classifier decision.

Given a classifier and an image instance , LIME outputs the weights for all pixel patches of the image . can be seen as coefficients for a linear model that acts as a surrogate for the possibly complex decision boundary of . This linear model should approximate the decision boundary in the locality of . To achieve this, first a pool with size (user defined constant) of perturbed versions (in the following named ) of is generated. For images, this is achieved by randomly removing patches from the image and replacing them with the mean color of the patch or with some chosen color (default is grey). Every instance of consists of the triples with being the classification result of for the perturbed version in the image space and being a proximity measure that indicates how different the perturbed version is from the original instance. This measure is used to enforce locality for the linear model . The weights are ultimately found through K-Lasso, a procedure that is based on the regression method Lasso [18]. The input is the pool and a user defined feature limit which is the number of patches the user wants in its explanation.

3 Superpixel methods

Pixels, which are used to represent images in grid form, do not represent a natural representation of the depicted scene. If a single image pixel is viewed, neither its origin in the original image nor its semantic meaning can be determined. This results from the process of creating digital images. Pixels are artefacts that are created by the process of taking and creating the digital image [14].

In comparison, the origin and semantic meaning of a superpixel can be determined. A superpixel is a local grouping or combination of pixels based on common properties, such as the color value. The advantages of superpixels can be summarized as follows [14]:

  • Lower complexity: Although the superpixel algorithm must be applied first to enable the name-giving groupings of pixels, this process reduces the complexity of the image due to the small number of entities. In addition, subsequent steps based on these superpixels require significantly less processing power.

  • Significant entities: Individual pixels are not very meaningful. However, pixels in a superpixel group share properties such as texture or color distribution. Through this embedding, superpixels gain an expressiveness.

  • Marginal information loss: superpixel approaches tend to oversegmentation. Thus, important areas are differentiated, but also insignificant ones. However, this apparent disadvantage basically has the positive aspect of only a minor loss of information.

3.1 Felzenszwalb

The algorithm of Felzenszwalb and Huttenloch (FSZ) [3] is to be categorized as a graph-based approach and can be described as an edge-oriented method. The approach has a complexity of .

First, the algorithm calculates a gradient between two adjacent pixels. This is weighted according to the characteristic properties of the pixels, for example based on the color and brightness of the individual pixels. Subsequently, individual segments - the seed for future superpixels - are formed per pixel. The aim of this process is to make the differences between the gradients within the segment as small as possible but make the differences as large as possible for adjacent segments. The resulting superpixels should neither be too small or too large. However, this algorithm lacks a direct influence on the size and number of superpixels. This usually results in a very irregular size and shape distribution [3].

3.2 Quick-Shift

Quick-Shift (QS) is an algorithm LIME uses by default, it is described in detail in [20]. Its uses a so-called mode-seeking segmentation scheme to generate superpixels. This approach moves each point to the next point which are higher density (), which causes an increase in the density. QS does not have the possibility of controlling neither the number nor the size of the superpixels.

3.3 Slic

As the name Simple Linear Iterative Clustering (SLIC) [1]

suggests, this superpixel algorithm belongs to the group of cluster-based algorithms. SLIC uses the well-known K-Means algorithm

[8] as a basis, but there are essential differences:

  • The search space () is limited proportional to the size of the superpixel (). This significantly reduces the number of distance calculations.

  • In addition, the complexity is independent of the number of superpixels , whereby SLIC has a complexity of .

  • Furthermore, a weighted distance measure (see equation 1) combines the spatial () and color () proximity.

  • In addition, the control of compactness and size of the superpixels is ensured by a parameter ().

(1)

With the parameters the desired number of superpixels is defined. The cluster process starts with the initialization of cluster centers (mathematically: ), which are scanned by a regular grid with a distance of pixels. By approximately even superpixels are guaranteed. Next, the centers are shifted in the direction of the position of the smallest gradient within a 3 range. This is done, among other things, to avoid placing a superpixel at an edge.

Then each pixel is assigned to the nearest cluster center whose search area () overlaps with the position of the superpixel (). The nearest cluster center is determined by the distance measure (see equation 1).

Then the average vector of the pixels belonging to each cluster center is calculated by an update step for each cluster center and adopted as the new cluster center. Finally, a residual error between the new and the old cluster center is determined. The assignment and calculation step can be repeated until the residual error reaches a threshold value (). Finally, all unconnected pixels are added to a nearby superpixel.

3.4 Compact-Watershed

The Compact-Watershed (CW) [11] algorithm is an optimized – respectively a more compact – version of the superpixel algorithm Watershed [9]. As input a gradient image is used. Because the grey-tone of each pixel is considered as an altitude, the input can be seen as a topographical surface. Then this surface gets continuously flooded, resulting in watershed with catchment basins. During this process over-segmentation may occur. For prevention, so called markers are used [9].:

  1. The set of markers(for each one a different label) where the flooding should begin has to be chosen.

  2. A priority queue will be created and collects the neighboring pixels of each marked area. Each pixel is graded a priority level which corresponds with the gradient magnitude of the pixel.

  3. The pixel with the highest level of priority, gets pulled out of the priority queue. This pixel gets labeled with the same label as its neighbors if all of its neighbors are already labeled. The neighbor pixels who are not yet marked and are not contained in the priority queue are pushed into this queue.

  4. Repeat the previous step (3) until the priority queue is empty.

Those pixel who are still not labeled after the priority queue is empty are the watershed lines.

Compact-Watershed is derived from the original Watershed algorithm resulting in more compact superpixels in terms of size and extension. This is achieved by using a weighted distance measure between Euclidean distance of a pixel from the superpixel’s seed point and the difference of the pixel’s grey value compared to the seed pixel’s grey value.

(a) Felzenszwalb
(b) Quickshift
(c) SLIC
(d) Compact-Watershed
Figure 1: Superpixel approaches in comparison
Source: Original Photo by Baptist Standaert on Unsplash

4 Case Studies

4.1 Malaria

Malaria is a parasitic infectious disease. It is predominantly transmitted by anopheles mosquitoes, but can also be transmitted from person to person. This happens for example by blood transfusion, organ transplantation or by sharing injection needles [12, 21]. Malaria killed 435,000 people in 2017. Of these 266,000 were children under 5 years of age[21].

A network trained for the detection of malaria in cells and whose results are comprehensible by LIME thus has a great benefit in the application in the field of diagnosis of malaria. For this purpose a ResNet50 [6] was trained, the results are shown in Table 1.

The malaria data set [13] consists of blood smear images of the most used diagnostic tool Rapid Diagnostic Tests (RDT) [21]. The data are divided into two classes: positive and negative malaria labeled cells. In particular, the relatively large number and equally distributed (50%-50%) of training examples (26,758 total) promise a good basis for a meaningful network to assess whether a cell is infected with malaria or not.

Metric Value
Training accuracy 97.8182%
Training loss (cross entropy) 0.0573
Validation accuracy 96.5167%
Validation loss (cross entropy) 0.0970
Test accuracy 96.3715%
Test loss (cross entropy) 0.1069
Table 1: Model results for the Malaria model
(a) non-infected cell
(b) infected cell
Figure 2: Examples of the malaria dataset

4.1.1 Experiments

To enable an objective comparison of the superpixel approaches, the Jaccard-Coefficient [5] was calculated, which indicates the similarity of two sets. The similarity measure is determined between the results of the different superpixel approaches and the respective average relevant area of the decision. The relevant area per image is selected manually selects the indicator, which is most relevant to the decision making process.

(a) Original
(b) Average selection
Figure 3: Original blood smear image from the malaria data set, as well as the corresponding average selection

For the comparability of the results, 100 images of infected cells were selected from the test data set of the malaria blood smear images. To make it easier to select, only images with a single malaria indicator were selected. Of these 100 images, 85 were classified by the network as infected (true positive). The remaining 15 were classified as not-infected (false negative). Table 2 shows the result of the Jaccard-Coefficient for the respective superpixel approach with the true positive classified explanations, where only the most important feature for the network’s decision (see Figure 4) is displayed. All of the superpixel methods were optimized to the given case to maximize the average Jaccard-Coefficient , hence the optimized Quick-Shift version. This was done so that all of the superpixel approaches would be compared on a fair level.

Superpixel method Mean Value Variance Standard deviation
Felzenszwalb 0.85603243 0.03330687 0.18250170
Quick-Shift 0.52272303 0.04613085 0.21478094
Quick-Shift optimized 0.88820585 0.00307818 0.05548137
SLIC 0.96437629 0.00014387 0.01199452
Compact-Watershed 0.97850773 0.00003847 0.00620228
Table 2: Jaccard Coeffficient of the different superpixel methods
(a) Original
(b) Felzen-
szwalb
(c) Quick-
Shift
(d) Quick-
Shift opt.
(e) SLIC
(f) Compact-Watershed
Figure 4: LIME results for true positive predicted malaria infected cells
(a) Original
(b) Felzen-
szwalb
(c) Quick-
Shift
(d) Quick-
Shift opt.
(e) SLIC
(f) Compact-Watershed
Figure 5: LIME results for false positive predicted malaria infected cells

4.2 Tobacco plants

Tobacco is a significant plant used in biopharmaceutical production using genetically modified (GM) plants. Two important reasons are its ability to produce biomass quickly and minimum risk of food chain contamination because of the fact that tobacco is not a food crop [19]. It is able to produce proteins which can be used for treatment or diagnosis of various diseases. However, if plants are used to produce medicine for human use, strict regulations present in the field of pharmaceutical production must be observed. In this context it is desirable to monitor the health state of each plant to ensure only healthy plants are used for drug production, however different parts of world regulate pharmaceutical production of GM plants differently [16].

(a) Healthy Tobacco plant
(b) Stressed Tobacco plant
Figure 6: Examples of the Tobacco plant dataset

Stocker, et al. have investigated various methods to classify stress in tobacco plants using non neuronal AI approaches

[17]. Figure 6 shows sample images of healthy and stressed tobacco plants. We use the same tobacco data set as a case study to assess the suitability of CNNs for stress classification again using LIME to provide insights into the classification process. For the training the same ResNet50 as for the malaria dataset was used. The only difference was that, to prevent overfitting, the last layers were unfrozen during the training of the tobacco trainingsset. The tobacco data set consists of 700 images total divided into two classes, healthy and stressed. Only 81 images of stressed plants are contained in the data set, so expectations of a good classification result were limited.

Table 3 shows the trained model results on the tobacco plant data set. These clearly already show that the results should not be trusted to begin with, so we decided to discontinue work on this case study for the time being.

Metric Value
Training accuracy 91.2577%
Training loss 0.4459
Test accuracy 50%
Test loss 0.7524
Table 3: Model results for Tobacco plants
Hyperparameter Value
Epochs 50
Batch size 32
SGD learningrate 0,0001
SGD Momentum 0,90
SGD Nestrov Ja
Dropout 0,50
L2-Regulation 0,0001
Table 4: Hyperparameter for the training of both models

5 Discussion

In the following, the experiments with LIME from the previous chapter, their results and the possible improvement of the visual explainability are discussed (see Table 2). Although FSZ is an older algorithm compared to the other approaches considered, still good results were achieved. Surprisingly, QS, the standard algorithm of LIME, is surpassed by about 33.33%. Since FSZ itself does not have any parameter, which could limit the size of a superpixel, it seems that LIME can act pretty much freely and can generate the superpixels purely based on relevance regarding explainability. A good example for such a case is the explanation from LIME for the false positive classification shown in Figure 5. In contrast to the other superpixel methods explored though FSZ’s decision for not infected is more comprehensible. However the variance and standard deviation for the true positive examples, indicates the similarity vary significantly and with FSZ the results are not stable and may sometimes show regions as relevant for the decision which are actually not important. This for example is the case for the first result from LIME while using FSZ (see Figure 4).

The optimized version of QS, remarkably achieved an improvement of 36.55% compared to the standard version of LIME. Additionaly it performs slightly better than FSZ - with an improvement of 3.22% - and the variance and standard deviation are also lower, which indicates the results are more stable than with FSZ and the unoptimized QS version.

SLIC makes it possible to influence the actual size of the superpixels through a parameter. Consequently, the higher similarity measure with over 44.17% compared to QS and over 7.62% compared to the optimized version of QS, is not surprising. Additionally, a lower variance and standard deviation was achieved. These results show that SLIC has advantages over QS due to showing a better correspondence between superpixels and relevant areas.

The last superpixel approach compared with QS was CW. Like SLIC it supports influencing the compactness of the resulting superpixels. In comparison to all other superpixel approaches CW yielded the best results. This approach achieved an improvement of 45.58% over the standard QS and an improvement of 9.03% over the optimized QS version. It also significantly reduces the varianceand standard deviation. This indicates there is a very correspondence over all the 85 images.

6 Conclusion

Our results suggest that tailoring of the superpixel approach - whether by an optimized version of QS or by FSZ, SLIC or CW - to the task will improve the visual explainability of LIME. Therefore a selecting a suitable algorithm for LIME can be beneficial and should be considered. With the exception of QS the remaining approaches segment fewer irrelevant areas of an image (see Figure 4). It was also observed that CW achieved the best results.

In applications where large area and uneven features are to be emphasized, an approach like CW would possibly do worse because it divides the input into very small, evenly sized superpixels. FSZ, which generates superpixels in significantly different sizes, may even achieve the best results in such application areas. Consequently, the finding that CW does give the best results in malaria is not universally valid and the superpixel approaches should be evaluated by experts in different application areas. Another conclusion is that superpixel methods other than QS are more suitable for LIME.

Since the area of pharmaceutical and agricultural applications is an emerging research area for applying machine learning to digital plant phenotyping tasks, we plan to continue pursuing the ideas begun in the tobacco case study. We suspect that an objective assessment of plant health will yield better results if based on 3D data, because the habitus of a plant should then be represented more realistically than in a purely texture based 2D analysis as in the tobacco case study. Furthermore, the number of training images in said case study was insufficient, so the goal will be to generate a greater data set containing 3D scans of plants to continue research on this subject.

References

  • [1] Achanta, R., Shaji, A., Smith, K., Lucchi, A., Fua, P., Süsstrunk, S.: Slic superpixels compared to state-of-the-art superpixel methods. IEEE Transactions on Pattern Analysis and Machine Intelligence 34(11), 8. 2274 – 2282 (2012)
  • [2] Bach, S., Binder, A., Montavon, G., Klauschen, F., Müller, K.R., Samek, W.: On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation. PloS one 10(7), e0130140 (2015)
  • [3] Felzenszwalb, P.F., Huttenlocher, D.P.: Efficient graph-based image segmentation. International journal of computer vision 59(2), 167–181 (2004)
  • [4] Fulkerson, B., Vedaldi, A., Soatto, S.: Class segmentation and object localization with superpixel neighborhoods. In: 2009 IEEE 12th international conference on computer vision. pp. 670–677. IEEE (2009)
  • [5] Jaccard, P.: Lois de distribution florale dans la zone alpine. Bull Soc Vaudoise Sci Nat 38, 69–130 (1902)
  • [6] Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun: Deep residual learning for image recognition. CoRR abs/1512.03385 (2015)
  • [7]

    Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems. pp. 1097–1105 (2012)

  • [8] Lloyd, S.: Least squares quantization in pcm. IEEE transactions on information theory 28(2), 129–137 (1982)
  • [9] Meyer, F.: Color image segmentation. In: Image Processing and its Applications, 1992., International Conference on. pp. 303–306 (1992)
  • [10] Muggleton, S.H., Schmid, U., Zeller, C., Tamaddoni-Nezhad, A., Besold, T.: Ultra-strong machine learning: comprehensibility of programs learned with ilp. Machine Learning 107(7), 1119–1140 (2018)
  • [11]

    Neubert, P., Protzel, P.: Compact watershed and preemptive slic: On improving trade-offs of superpixel segmentation algorithms. In: Pattern Recognition (ICPR), 2014 22nd International Conference on. pp. 996–1001 (2014)

  • [12] Nocht, B., Mayer, M.: Die Malaria: Eine Einführung in Ihre Klinik, Parasitologie und Bekämpfung. Springer Berlin Heidelberg, Berlin, Heidelberg and s.l., zweite erweiterte auflage edn. (1936). https://doi.org/10.1007/978-3-642-91256-6, http://dx.doi.org/10.1007/978-3-642-91256-6
  • [13] Rajaraman, S., Antani, S.K., Poostchi, M., Silamut, K., Hossain, M.A., Maude, R.J., Jaeger, S., Thoma, G.R.: Pre-trained convolutional neural networks as feature extractors toward improved malaria parasite detection in thin blood smear images. PeerJ 6, e4568 (2018)
  • [14] Ren, X., Malik, J.: Learning a classification model for segmentation (2003)
  • [15] Ribeiro, M.T., Singh, S., Guestrin, C.: Why should i trust you?: Explaining the predictions of any classifier. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. pp. 1135–1144. ACM (2016)
  • [16] Spök, A., Twyman, R.M., Fischer, R., Ma, J.K., Sparrow, P.A.: Evolution of a regulatory framework for pharmaceuticals derived from genetically modified plants. Trends in Biotechnology 26(9), 506–517 (2008)
  • [17] Stocker, C., Uhrmann, F., Scholz, O., Siebers, M., Schmid, U.: A machine learning approach to drought stress level classification of tobacco plants. In: LWA. pp. 163–167 (2013)
  • [18] Tibshirani, R.: Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological) pp. 267–288 (1996)
  • [19] Tremblay, R., Wang, D., Jevnikar, A.M., Ma, S.: Tobacco, a highly efficient green bioreactor for production of therapeutic proteins. Biotechnology Advances 28(2), 214–221 (2010)
  • [20] Vedaldi, A., Soatto, S.: Quick shift and kernel methods for mode seeking. In: European Conference on Computer Vision. pp. 705–718 (2008)
  • [21] World Health Organization, et al.: World malaria report 2018 (2018)
  • [22] Zhang, Q.s., Zhu, S.C.: Visual interpretability for deep learning: A survey. Frontiers of Information Technology & Electronic Engineering 19(1), 27–39 (2018)