Over the past decade, deep neural networks (DNNs) have shown breakthroughs in various AI tasks and have dramatically changed many fields, such as computer vision and Natural Language Processing (NLP). However, the lack of transparency in DNN models has led to serious concerns about the widespread deployment of ML/DL technologies, especially when these black-box models will be endowed with a decision-making role in life-critical applications, such as medical diagnoses, automatic pilot, intelligent surveillance, and space engineeringet al..
Numerous methods for model explanation have been proposed, but always yields to unsatisfactory results. Some existing methods [32, 22, 2] generate importance maps based on intermediate information. For example, CAM (Class Activation Mapping) requires specific model structure with a global average pooling layer; Grad-CAM  needs weights and feature map values for weighted summation. In addition to the inconvenience of application, the results are also with low resolution and visually coarse due to the up-scaling operations in the heat map generating process. As shown in Figure 1, Grad-CAM can locate objects such as bananas and skiers, but it covers too large an area which goes beyond the scope of the object region, thereby reducing its reliability and validity. The black-box model methods [19, 7, 14, 26] get rid of the disadvantages of inconvenience, but there are still problems of accuracy and granularity. LIME 
solves the model-agnostic problem in a traditional machine learning thinking, by fitting a small local linear model to explain the prediction of much more complicated model. However, linear model can not bear the weight of millions of data, which leads to over-fitting. Therefore, they yield to simplification of input sampling method (from pixel-wise to super-pixel-wise) which results in coarse-grained and low-accuracy. For samples such as banana and wall-clock in Figure1, LIME roughly covers the region. However, it can only cover part of the target’s patches and also covers some patches in wrong location, resulting in the non-adaptive values of probability threshold which needs user’s manual input. BBMP  utilizes iterative optimization of saliency map with Adam optimizer, but when the background is complex and deceptive, it is difficult to converge and correctly locate the target, thus making a terrible explanation for the first three samples. RISE  can generate a pixel-wise heat-map, but the grid-based sampling method results in poor performance when non-rectangular shaped targets and multiple tiny objects are in the same category. It is only applicable to rectangular-like objects, such as the goldfish sample in the forth row. A very recent work Extremal Perturbation(EP)  identifies an optimal mask to occlude that gives the maximum efficacy on the output of CNN model. It is the State Of The Art (SOTA) method and we have comparison in Sec.8.
In summary, the main contributions of this paper are summarized as follows:
(1) We invent the Morphological Fragmental Perturbation Pyramid (MFPP), a new perturbation technique for black-box model explanation, which perturbs morphological fragments of different scales and makes full use of input semantic information;
(2) We apply MFPP to randomized input sampling method and significantly improve its performance on explanation accuracy score;
(3) We perform qualitative and quantitative evaluations on multiple datasets and models, and prove MFPP has better explainability for predictions from black-box DNNs, with better accuracy and is an order of magnitude faster than state-of-the-art method.
The rest of the paper is organized as follows: Section II briefly reviews related works in model explanation area; Section III proposes our black-box model explanation method MFPP; Section IV shows the intuitive and quantitative experiments and corresponding results analysis; Finally, we summarize our work in Section V.
Ii Related work
In the past several years, numerous methods [32, 22, 2, 19, 14, 26] have been proposed to explain and visualize predictions of deep CNN classifiers, which have significantly promoted the model interpretability and model design optimization study. Survey works from ,  and  have given comprehensive summary on those methods. In this section, we provide several criteria to delimit previous approaches for visual explanation in multiple dimensions. This can allow researchers to change their perspectives to thoroughly understand the differences between each category.
Ii-a Model-dependent vs. Model-agnostic
Since the ultimate goal of model interpretation is to diagnose and analyze how a model works, the model itself is the protagonist. The question of whether the model is a black-box or a white-box divides different methods into two main camps: model-dependent methods (MDM) and model agnostic methods (MAM). For MDM, internal model values must be partially available because these methods use a linear combination of activations from the final convolutional layer to generate an explanation. CAM generates a low-resolution localization map of the important regions in the image by using the class-specific gradient information flowing into the final GAP(Global Average Pooling ) layer of a CNN model. Grad-CAM requires values of activations and specified feature map. Grad-CAM++  even requires smooth prediction function, since it utilizes third-order derivatives. In short, model-dependent methods often have strict limitations on their use.
. Generally, with input samples and corresponding output results, a model’s prediction should be able to be visually explained. The classifiers are not limited to CNN models, and their application can be easily extended to other models, such as Supported Vector Machine (SVM), decision tree, and random forests. Model-agnostic methods have a wide range of practicalities.
Ii-B Patch-wise vs. Pixel-wise perturbation
Input perturbation is an universal method which we measure the variation of output from changing inputs by removing or inserting information from the image such as masking, blurring, and replacing. It’s also applied to generate volume prediction data for random sampling or local model fitting. BBMP discusses which perturbation method is most meaningful by comparing three ways as regional blurring, constant value replacing and noise adding perturbations. The perturbated regions could be selected as pixels or patches, which would lead to the discrepancy of result.
Pixel-wise perturbation is adopted by FGVis, Real-time saliency , while LIME, Anchors , Regional  are patch-wise perturbation based. This will affect the granularity of the output heat-map. In general, pixel-wise heat-map is more accurate in terms of location, but short of details because their results are spatially discrete. Patch-wise result is more visually pleasing and close to model-dependent result where boundaries can better fit the object boundaries.
Ii-C External model fitting vs. statistical method
Operation methods on how to process the model outputs of perturbation inputs further divides different methods into two camps and also make the result and processing time very different.
LIME needs to locally fit another linear model with ridge regression to facilitate interpretation. Anchors method extends this idea further to locally fit a decision tree for better interpretability. But both of them manually set the number of iterations for local model fitting. That could not make sure the model fits well. BBMP also iteratively optimizes the heat-map with gradient-based optimization method (Adam ). Our experiment results show that, the number of iterations will visibly affect the accuracy and appearance of the final importance map, and the authors set it to a constant value based on experience. On the contrary, RISE uses statistical method, they weighted sum all sampling outputs without extra model fitting or optimizer iteration. Table II shows its advantage on average speed aspect.
Iii Proposed Method
Input perturbation is a generic method for existing approaches[LIME, Anchors, BBMP, RISE, FGVis, EP]. In the work of RISE, Petsiuk has proposed a randomized input sampling method. They randomly shelter half of 7x7 grids and additionally make a random transportation shift to generate masks. Ignoring the morphological characteristics of object makes the visual explanation result unsatisfactory as shown in Figure 1. In the work of LIME, coarse-grained super-pixels perturbation makes their explanation coarse-grained and low accuracy as shown in Figure 1. In Sec.4, we show how morphology matters to the visual explanation result and analyze the relationship; In Sec.5, we introduce a novel method of input perturbation to bridge multi-scale morphological information with input perturbation method. It achieves great improvement on the granularity of explanation result; Sec.III-C shows the overall flow chart of our proposed method.
Iii-a Morphological analysis on visual explanation
The object with its surrounding region is the major foundation for classification model to do the decision-making. What consists an object matters to the visual explanation result for each prediction. Based on the theory of morphology, an object consist of shapes, textures, colors. Zeiler et al.make the work  of visualizing and understanding convolutional networks with DeconvNet , the visualization of outputs for each convolutional layer proves that low-level layers focus on edges, corners, colors; then is the nets; following are textures and partly shapes; high-level layers focus on the whole object and prefer to extract the semantic information of complete shape. Motivated by this work, a question comes up: have we fully utilized the morphological information when we do the input perturbation?
As Figure 3 shows, firstly, the image is segmented into different fragments also called super-pixels. Then we perturb the super-pixels and send them to model for prediction. Thirdly, we statistically measure the scores of the ’bird’ class in each prediction. At last, lowering threshold of explaining score to visualize the probabilities distribution for fragments constituting the target object.
Based on this method we can evaluate contribution of each patch for a specific class. For the ”bird” class prediction, as shown on top-right figure of Figure 3, we mark fragments with different colors; the brighter the color, the higher the score. It can be seen that the highest-scoring patches cluster in the region is where bird located and the score decreases from the inside to the outside, as shown at the bottom of the Figure 3.
This phenomenon gives us enlightenment to generate morphology-based masks to better perturb the model input with its semantic distribution. As segmentation is the foundation of mask generation in our method, a fast and effective super-pixel algorithm SLIC  with time complexity is adopted. In Figure 4, we generate different segmentation results with various sigma values in SLIC which control the smoothness of fragments boundary to see its impact on the final result. The charts a d in Figure 6 shows the relationship for all factors. In this method, we choose the value which is able to better fitting the edges in the image since they are probably boundary of foreground and background or between instances.
Iii-B Fragments Perturbation Pyramid
Lin et al.propose Feature Pyramid Network  to exploit the inherent multi-scale pyramidal hierarchy of deep convolutional networks that shows significant improvement in several applications. In object detection domain, from sliding window to multiple feature maps as input of classification module, people are always seeking a way to make the predictors have abilities to see objects in vastly different scale. In Faster-RCNN , anchors are designed to place on several layers of feature map, to extend receptive field into different scale, which also significantly improve the accuracy on objects in different scales. Besides, One-step object detection network YOLO [15, 16, 17] series continue evolve with adding more feature maps of different scales into evaluation phase. According to this principle, we transfer this classic idea from detection to explanation.
As Figure 5 shows, the input image is fed into segmentation algorithm and segmented into fragments with different granularity, by changing the values of fragment number in segmentation algorithms. These fragments are templates used for randomized mask generation.
This method enables those model interpreters to view multi-scale objects in the region, as shown in Figure 5. This is great helpful in enriching explanation sources.
Figure 2 shows the overall structure of MFPP and all the data flow for prediction of black-box model explanation.
The input image is sent to segmentation algorithm to generate multiple segmented fragments. Then masks are generated by randomly turning off these fragments with zero grayscale. We element-wise multiplied these masks with input to get masked images. Masked images are fed to the black-box model to get prediction score of target class, which is taken as weight to weighted sum of all masks. Then we get the final output as importance map.
Let be the space of a color image with size where is a discrete lattice. Let be a black model, which could be a CNN model, that maps the image to a scalar output value in . The latter could be an output activation, corresponding to a class prediction score, in a model trained for image classification, or an intermediate activation. In the following, we investigate which parts of the input strongly activate the category, causing the response to be large. In particular, we would like to find a mask assigning to each pixel a value , where means that the pixel strongly contributes to the output and that it does not. Based on Monte Carlo method, we can get
denotes as image segmentation operation, is the output of image with this operation:
In this case, is the total number of masks with different segmentation style, function provides the number of fragments in each group.
Note that our method does not use any information from the model inside and thus, is suitable for explaining any black-box models.
Iv Experiments and Results
Iv-a Datasets and based models
The experiments are performed with single P6000 GPU. The platform for our algorithm implementation and evaluation is PyTorch 1.2.0. Each input imageis resized to 224x224.
In the experiments, the visual comparisons of four SOTA methods and our proposed method MFPP are on typical samples from MS-COCO 2014 dataset . The quantitative experiments are conducted on the whole test subset of PASCAL VOC . The trained-models which our evaluation based on are VGG16  and ResNet50 .
Iv-B Intuitive Results
In Figure 1, we have listed the outputs of total 5 methods, including MFPP which is our main proposal. For human intuitive judgement, MFPP provides more accurate and finer-grained importance map than other competitive methods.
The explanation outputs of Grad-CAM are low resolution and coarse due to their upscaling operation in the process of heat-map generation. As Figure 1 shows, Grad-CAM can locate the objects like banana in the first row, or the skiers in the fifth row, but it covers too large an area and far overstep the boundary of the object region, that would decrease the degree of reliability. For the goldfish sample in forth row, it shows weakness for the case of multi-tiny-objects in same category within a picture.
Some previous model-agnostic methods such as LIME, BBMP, RISE and FGVis have accuracy and granularity issues. LIME fits a small local linear model to explain the prediction of much more complicated model, which brings the risk of over-fitting. They yield to bring in the simplification of input sampling method from pixel-wise to super-pixel-wise which leads to coarse-grained results with low-accuracy. For the first two samples like banana and wall-clock, this method roughly covers the region but only partially matches the target. It also has some wrong location patches are covered, due to the fixed values of probability threshold. It’s also not good at the case of multi-tiny-objects in the same category. BBMP does a horrible explanation on first 3 samples since it can not correctly locate the target when the background is complex and deceptive. RISE can generate a pixel-wise heat-map , but its grid-based sampling methodology leads to two issues: first, poor performance on abnormal shape targets, which only applies to rectangular-like object; second, it loses granularity as the baseball player sample shows in third row. The second column shows performance of MFPP: in first two and the last samples, it does best job in localizing important pixels and fitting boundaries; In goldfish sample, it finds all instances ad makes no mistake; In baseball sample, it is not impacted by the complex background and find the exact body shape area for target object. EP is not included in this part as it only has dots output.
Iv-C Quantitative results
In this section, we quantitatively evaluate the localization accuracy of explainability and explanation speed on pointing game experiments . Pointing game is extracting maximum points from saliency map and measuring whether it falls into ground truth b-boxes, we define the ratio of correct hits is final score as . The experiments are conducted on PASCAL VOC07 test dataset which contains 4952 images with 20 categories ground truth labels. We repeat the experiments three times and take the average value. The results of reference methods are partly taken from .
Two versions of MFPP with different number of masks are included in the evaluation, Fast-MFPP contains 4k masks while MFPP contains 20k. For localization accuracy, as Table I shows, MFPP gets the highest score on ResNet50 as 89.1% which is higher than 88.9% from current SOTA method EP . EP keeps the record on VGG16 as 88.0%. For processing time benchmark, as Table II shows, MFPP spends 32.63 seconds per explanation on ResNet50, which is 2.2 times faster than EP’s 72.09 seconds. It’s also 1.47 times faster on VGG16. Especially the fast version Fast-MFPP is 10.7 times faster on ResNet50 and 7.3 times faster on VGG16 than EP. It’s the fastest method in our benchmark for black-box model explanation.
Since performance of model explanation methods could be impacted by numerous factors. As a result, we evaluate the pointing game accuracy of MFPP with different parameters settings to quantitatively discuss the influence of them and then locate the most effective portfolio. The criteria are set as pointing game accuracy and average speed. As shown in Figure 6, (a) and (b) show affect of mask upscaling offset on VGG16 and ResNet50; (c) shows affect of fragmentation smoothness, the higher sigma value in x-axis, the smoother boundary we get. If sigma is high enough, we would get grids only as Figure 4 right-bottom subgraph shows, MFPP will simplify into multi-layer RISE; (d) shows the affect of fragment numbers, extreme dense or sparse fragmentation would damage the final performance. In Figure 7, (a) and (b) show affect of mask numbers on accuracy on VGG16 and ResNet50, the accuracy is slowly rising with the increasing number of masks. When 20k number of masks are adopted by MFPP, it exceeds SOTA and gets highest score. In Figure 8, (a) and (b) show corresponding processing time with different mask numbers, they are linear relationship. In summary, pointing game experiments on the PASCAL VOC07 test dataset show that our proposed approach MFPP exceeds the performance of SOTA black-box explanation method on accuracy. Meanwhile, MFPP is at least twice as fast as SOTA method for average processing time per sample. Especially, the degradation version Fast-MFPP is more than 10 times faster than SOTA method on ResNet50 while it matches the same level of accuracy (within 0.5%).
This paper proposes MFPP, a novel method to explain predictions from black-box deep neural network with a multi-scale morphological fragmental perturbation module. Firstly, we prove morphological fragmentation to be a more efficient perturbation method than previous works for randomized input sampling in model explanation task. Secondly, MFPP has the capability to generate finer-granularity explanation results on critical-shaped objects in intuitive benchmark. Thirdly, quantitative experiments on the entire PASCAL VOC07 test dataset show that the performance of MFPP exceeds SOTA black-box explanation method EP  on classic pointing game accuracy score. The speed of MFPP is at least twice as fast as EP. In particular, the fast version of MFPP is an order of magnitude faster than EP on ResNet50, while achieving the same level of accuracy. MFPP does a better job on both accuracy and speed aspects than state-of-the-art method. We believe MFPP could be a promising explanation method in practical deep neural network diagnosis domain.
-  Radhakrishna Achanta, Appu Shaji, Kevin Smith, Aurélien Lucchi, Pascal Fua, and Sabine Süsstrunk. Slic superpixels. Technical report, EPFL, 06 2010.
-  Aditya Chattopadhay, Anirban Sarkar, Prantik Howlader, and Vineeth N Balasubramanian. Grad-cam++: Generalized gradient-based visual explanations for deep convolutional networks. WACV 2018, Mar 2018.
-  Piotr Dabkowski and Yarin Gal. Real time image saliency for black box classifiers. In Advances in NIPS, pages 6967–6976, 2017.
-  Mengnan Du, Ninghao Liu, and Xia Hu. Techniques for interpretable machine learning. arXiv preprint arXiv:1808.00033, 2018.
-  Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. The pascal visual object classes (voc) challenge. International journal of computer vision, 88(2):303–338, 2010.
-  Ruth Fong, Mandela Patrick, and Andrea Vedaldi. Understanding deep networks via extremal perturbations and smooth masks, 2019.
-  Ruth C. Fong and Andrea Vedaldi. Interpretable explanations of black boxes by meaningful perturbation. ICCV 2017, Oct 2017.
-  Riccardo Guidotti, Anna Monreale, Salvatore Ruggieri, Franco Turini, Fosca Giannotti, and Dino Pedreschi. A survey of methods for explaining black box models. ACM computing surveys (CSUR), 51(5):93, 2019.
-  Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. CVPR 2016, Jun 2016.
-  Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. ICLR, 12 2014.
-  Min Lin, Qiang Chen, and Shuicheng Yan. Network in network, 2013.
Tsung-Yi Lin, Piotr Dollar, Ross Girshick, Kaiming He, Bharath Hariharan, and
Feature pyramid networks for object detection.
The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.
-  Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. Microsoft coco: Common objects in context. Lecture Notes in Computer Science, page 740–755, 2014.
-  Vitali Petsiuk, Abir Das, and Kate Saenko. Rise: Randomized input sampling for explanation of black-box models, 2018.
-  Joseph Redmon, Santosh Kumar Divvala, Ross B. Girshick, and Ali Farhadi. You only look once: Unified, real-time object detection. CoRR, abs/1506.02640, 2015.
-  Joseph Redmon and Ali Farhadi. Yolo9000: Better, faster, stronger. CVPR 2017, Jul 2017.
-  Joseph Redmon and Ali Farhadi. Yolov3: An incremental improvement. 2018.
-  Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(6):1137–1149, Jun 2017.
-  Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. “why should i trust you?”. Proceedings of the 22nd ACM SIGKDD ’16, 2016.
Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin.
Anchors: High-precision model-agnostic explanations.
Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
-  Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, and et al. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211–252, Apr 2015.
-  Ramprasaath R. Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. Grad-cam: Visual explanations from deep networks via gradient-based localization, 2016.
-  Dasom Seo, Kanghan Oh, and Il-Seok Oh. Regional multi-scale approach for visually pleasing explanations of deep neural networks. arXiv preprint arXiv:1807.11720, 2018.
-  Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Deep inside convolutional networks: Visualising image classification models and saliency maps, 2013.
-  Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition, 2014.
Jorg Wagner, Jan Mathias Kohler, Tobias Gindele, Leon Hetzel, Jakob Thaddaus
Wiedemer, and Sven Behnke.
Interpretable and fine-grained visual explanations for convolutional neural networks.In Proceedings of the IEEE Conference on CVPR, pages 9097–9107, 2019.
-  Matthew Zeiler, Graham Taylor, and Rob Fergus. Adaptive deconvolutional networks for mid and high level feature learning. ICCV, 2011:2018–2025, 11 2011.
-  Matthew D Zeiler and Rob Fergus. Visualizing and understanding convolutional networks. In ECCV, pages 818–833. Springer, 2014.
-  Jianming Zhang, Sarah Adel Bargal, Zhe Lin, Jonathan Brandt, Xiaohui Shen, and Stan Sclaroff. Top-down neural attention by excitation backprop. International Journal of Computer Vision, 126(10):1084–1102, 2018.
Quan-shi Zhang and Song-Chun Zhu.
Visual interpretability for deep learning: a survey.Frontiers of Information Technology & Electronic Engineering, 19(1):27–39, 2018.
-  Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva, and Antonio Torralba. Object detectors emerge in deep scene cnns. arXiv preprint arXiv:1412.6856, 2014.
Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva, and Antonio Torralba.
Learning deep features for discriminative localization.CVPR 2016, Jun 2016.