In most areas of computer vision and pattern recognition, deep convolutional neural networks (CNNs) have outperformed other methods, especially in image classification taskskrizhevsky2012imagenet ; chen2013multi ; simonyan2014very ; szegedy2015going ; he2016deep ; tissera2016deep ; guo2016deep ; yu2017convolutional
. However, a conceptual understanding why these networks work so well is still largely lacking. Since visualization is an effective way to bring insight into these matters, in recent years a number of such techniques have been proposed, both from the visual analytics and the deep learning communities, aiming to understand how and why deep learning works. Visual analytics researchers typically go about the problem by designing a GUI interface with standard graphics tools to illuminate the connections among neurons intuitivelyharley2015interactive ; liu2017towards . Deep learning researchers, on the other hand, tend to focus on visualizing the learned features of each neuron using different optimization-based algorithms.
The optimization-based methods are generally divided into two categories: activation maximization erhan2009visualizing ; simonyan2013deep ; yosinski2015understanding and code inversion mahendran2016visualizing ; dosovitskiy2016inverting
. In order to depict a single neuron, the strategy of activation maximization methods is to generate an image which maximally activates this specific neuron. The final result then serves as a representative of the candidate features learned by this neuron. The code inversion methods adopt a similar principle but with a different objective. At a specific layer, code inversion aims to produce an activation vector which is close to the target vector generated by a real image. In this way it is revealed how the CNNs encode the input at each layer. Unfortunately, neither of these two methods can return images recognizable to a human analyst because the solution space is so vast. This severely diminishes their usefulness. Besides, as is mentioned inwei2015understanding , one single neuron does not necessarily respond only to one single feature but multiple features at the same time. To account for this problem, Nguyen et al. nguyen2016multifaceted proposed the multifaceted feature visualization (MFV) method to visualize different facets of each neuron. Although more realistic than before, the MFV method’s results are still bizarre and hallucinogenic.
In spite of the strong progress made in deep learning visualization kahng2018cti ; pezzotti2018deepeyes , one important aspect is largely overlooked, namely that these approaches are indirect and thus cannot specifically point out which areas of the image the network is trained to focus on. Bash et al. bach2015pixel
proposed the method of Layer-wise Relevance Propagation (LRP) to bridge this gap. For each input image, LRP propagates its classification probability backward through the trained network and calculates relevance scores for all pixels. By examining the intensity of different pixels in the relevance map, a direct impression of which pixels are deemed important by the network can be achieved. While this is a good step in the right direction, the LRP’s relevance map fails to reveal key structural information, making the attention areas indistinguishable. This renders LRP less ideal for showing how network models perceive images (see also Section2 for further discussion).
To address this problem we present a new two-step approach to understand deep learning. At its core is a new map, referred to as “Salient Relevance (SR) Map”, which directly points out the predominant attention areas of the deep network models. The basic scheme of our proposed method is outlined in Figure 1. First, starting out from the trained network model, we use LRP to generate a pixel relevance map for the given image subject to recognition. Second, we use a visual salience model to filter out irrelevant regions from this relevance map and thus reveal true attention areas, from which the model learns features representing the recognized object. The final result is the SR map, shown in Figure 1, second image column.
Visual saliency is a biological framework which detects dominant foci of human attention. Simonyan et al. simonyan2013deep introduced saliency into deep learning visualization – computing saliency maps for each class directly from derivatives. Due to their origin in neuroscience, saliency models accurately and reliably reflect real perception. Among saliency methods, context-aware saliency method is widely considered as one of top ones that extract salient objects together with their surroundings, making it most effective in capturing the whole area of visual attention. We therefore adopt the context-aware saliency approach goferman2012context to filter LRP output and reveal attention areas.
Recently, visual attention in deep neural networks has become one of the hottest research topics. And attention mechanisms have been shown to be an essential part of the success of deep neural networks cho2015describing ; vaswani2017attention ; kim2017structured in various applications. One important property of human perception is that one does not process the whole image all at once. Instead human eyes shift attention selectively to where they is needed. To some degree, neural network models recognize objects in a similar way in that classification scores are largely determined by certain attention area rather than the whole image. This provides a strong motivation for us to adopt the context-aware saliency detection algorithm to unveil the attention areas of neural network models.
We applied our proposed approach to several well-known CNN structures loaded with pre-trained parameters such as AlexNet and the VGG-16 on ILSVRC2012 validation dataset russakovsky2015imagenet . Our results demonstrate that regions which belong to the recognized objects are clearly highlighted in our SR maps. This is an interesting finding because it shows that neural networks, in some sense, mimic human vision where attention mechanism and visual saliency also play an important role itti1998model . Finally, our results also show that our approach can clearly identify weaknesses in a given CNN model (or in the data it has been trained with).
The major contributions of our work are twofold.
We first propose a new heatmap, Salient Relevance Map (SR), to understand Deep Learning. The SR is generated by a two-step method which combines saliency detection with LRP. Our algorithm effectively and efficiently reveals visual attention areas of the network model, thus reflects the network’s internal understanding of the input images behind its prediction.
We firstly utilize “attention” to understand and visualize CNN models. Although attention is widely used in computer vision field, to the best of our knowledge, we are the first to use attention areas to help understand and interpret how a CNN model recognizes an image.
The remainder of the paper is structured as follows. Section 2 discusses the LRP method and models of visual saliency. Our proposed algorithm is explained thoroughly in Section 3. To prove the effectiveness of our technique, we designed and performed extensive experiments which are described in Section 4. Section 5 presents conclusions along with a brief discussion of future work.
2 Related Work
In this section, we briefly introduce the layer-wise relevance propagation (LRP) algorithm and review some saliency work relevant to our proposed method.
2.1 The Layer-Wise Relevance Propagation (LRP) algorithm
LRP is an inverse method which calculates the contribution of a single pixel to the prediction made by the network in the image classification task. The overall idea of pixel-wise decomposition is explained in bach2015pixel . Here we briefly reiterate some basic concepts of LRP using a simple example.
Given an input image , a prediction score is returned by the model denoted as function . Suppose the network has layers, each of which is treated as a vector with dimensionality , where represents the index of layers. Then according to the conservation principle, LRP aims to find a relevance score for each vector element in layer such that the following equation holds.
As we can see in the above formula, LRP uses the prediction score as the sum of relevance scores for the last layer of the network and maintains this sum throughout all layers. Figure 2 shows a simple network with six neurons. are weights and are relevance scores to be calculated. Then we have the following equation.
Furthermore, the conservation principle also guarantees that the inflow of relevance scores to one neuron equals the outflow of relevance scores from the same neuron.
is the message sent from neuron at layer to neuron at layer and is computed using network weights according to the equation below.
where is to prevent numerical degenerations in case the denominator is close to zero.
2.2 The Saliency Map
Visual saliency is a biologically inspired model of measuring which information stands out relative to its neighbors and so attracts human attention carrasco2011visual . It was originally promoted by psychologists in the study of attention in infancy colombo2001development . Itti et al. itti1998model presented a computational architecture to introduce the basic Koch and Ullman model koch1987shifts to the field of computer vision. There is an extensive body of literature on various applications of visual saliency, such as zhu2014ensemble ; oh2016detection ; terzic2017texture ; mukherjee2017saliency to name just a few. Recently, researchers have also integrated saliency with the latest deep learning techniques in salient object detection and category-specific object detection han2018advanced . As is summarized in harel2007graph , most visual saliency models are implemented in three stages:
extraction: extracting low-level features over the image
activation: generating activation maps from the features
normalization/combination: normalizing the activation maps and combining them into a single saliency map
Saliency detection algorithms generally place an emphasis on identifying fixation points of a human viewer and detect a single dominant object. However, apart from raw saliency, the context in which the dominant object is located is equally essential in image understanding, and this is the reason why we choose context-aware saliency detection goferman2012context as our saliency model. The context-aware saliency detection algorithm extracts salient objects in the image together with their meaningful surroundings. As such it obeys all four psychological rules of human visual attention koffka2013principles , making it an ideal paradigm to reveal a true visual attention.
1: Consider local low-level features such as contrast and color
2: Suppress frequently-occurring features
3: Organize one or several centers of visual gravity
4: Maintain high-level factors
Saliency methods have garnered attention among deep learning researchers sundararajan2017axiomatic ; montavon2017explaining ; li2017cnn ; kindermans2017patternnet because they can address all the desirabilities mentioned above. Most of this existing work tries to calculate heatmaps from the predictions of the network, albeit using different equations.
Most recently, Kindermans et al.
pointed out that methods such as LRP may suffer from the problem of input variancekindermans2017reliability . We conducted our studies independently at around the same time than these authors but pursued a different and presumably more powerful variant of this approach. In our work. we go further and beyond LRP by adding saliency detection directly into the deep CNN understanding and interpretation scheme. The results we obtained using our framework show that this proposed approach is indeed highly effective.
3.1 Algorithm for Generating the Salient Relevance (SR) Map
We shall now describe the algorithm that generates our proposed Salient Relevance (SR) map. Figure 1 provides an illustration of our workflow, using an example image of a vase with flowers. Algorithm 1 lists the pseudo code.
In the first step we generate the standard relevance map using the layer-wise relevance propagation (LRP) algorithm. The input is the classification vector calculated by the network model from the input image (the bottom left image in Figure 1). To precisely infer the model’s perception of the input image, only the class of the highest probability value is retained while other values are set to zero. This probability is then propagated backward through the network, layer-by-layer. At each layer, the perception information is transferred from the network’s output feature maps to the input feature maps using the existing parameters. The propagation continues until the first layer of the model has been reached. The relevance map generated in this way is shown in the top left of Figure 1. It has the same size as the original image.
The second step refines the relevance map into our salient relevance (SR) map. We achieve this using context-aware saliency detection, as explained next. In context-aware saliency detection a single pixel is considered salient if the patch centered at this pixel is distinct from other image patches, and is so at multiple scales. It is also important to note that background patches tend to remain similar at different scales while objects in the foreground are likely to be salient only at a few scales. This helps to distinguish background from foreground regions. In order to include background regions surrounding the foci of attention, each pixel outside the attended areas is weighted based on its Euclidean distance to the closest attended pixel. In this way, interesting background areas of salient objects are incorporated into the saliency map, while non-interesting regions are excluded. Via this procedure, our method then uses the information gathered at multiple scales to increase the contrast between salient and non-salient pixels. The salient relevance map for our vase with flowers example is shown in the center column, top image of Figure 1,
Finally, we integrate the salient relevance (SR) map with a Canny edge map obtained from the input image to afford a more contextual visualization of the SR map. Figure 1, center column, bottom image shows the Canny edge map, while Figure 1, right image, shows the Canny edge map fused with the SR map.
3.2 Comparative Experiments
In the following we use two running examples to illustrate our proposed method, and also compare it with the conventional one. The examples will show that our method better unveils the model’s real perception. The first study uses the pre-trained AlexNet krizhevsky2012imagenet as the network model and images from ILSVRC2012 validation dataset as the input. Results are shown in Figure 3.
This image is labeled ”grand piano” as ground truth, and AlexNet classifies it correctly. Using LRP, we can propagate the prediction probability backward through the network to see which pixels contribute to the classification result. It is fair to assume that those pixels should fall within the area of the piano in the image since the network recognizes this object. However, as we can see in2(c), it is not the case. Although pixels which represent the piano are present in the LRP relevance map, many irrelevant pixels are also included such as those on top. In fact, those unrelated pixels are so prominent that it is quite difficult to distinguish the piano from the relevance map.
To determine the exact attention area, we apply the saliency detection procedure on the relevance map. This generates the SR map where we observe that all of the dominant pixels in fact belong to the detected object. The SR map clearly shows the attention area of the network model which is the piano in the front. Next, we fuse both the relevance map and the SR map with the Canny edge map canny1987computational of the original image, with the aim to better visualize the maps in the context of the scene’s boundaries. Comparing 2(e) and 2(f) it is obvious that our method outperforms LRP at precisely revealing the attention area of the network.
In order to show that our method successfully captures the network’s perception focus of an image, we compare the saliency map of the original image with our SR map. The former represents what human eyes would recognize while the latter shows what the network detects. Figure 4 shows the difference.
As we can see in Figure 3(b)
, the dominant objects in the attention area include the horse, the man, and the horse trailer further away in the back. Since among these three dominant objects only the horse is included in the ImageNet training dataset, it comes at no surprise that our pre-trained AlexNet model classifies the image as “sorrel” which is a breed of horse named after its hair coat color. As we can see in our SR map (which is derived from the relevance map), only pixels which belong to the horse are prominent. Neither the man nor the horse trailer are present. One thing to note is that the man’s hair is also highlighted in our SR map. That is because the man’s hair shares a similar color and texture as the horse’s hair. These results vividly prove that our SR map is capable of effectively showing the network’s real understanding.
For quantitative comparisons, we choose the structure similarity index (SSIM) wang2004image as a metric. We apply the context-aware saliency detection algorithm directly on the original input images and use these saliency maps as the gold standard because saliency algorithms convey the true visual perception of human observers. Then we calculate the SSIM value of the LRP relevance map and our SR map with the reference saliency map, respectively. Higher SSIM values indicate better perception and thus are more preferable. We use the pre-trained AlexNet as the model and randomly select images from ILSVRC2012 validation dataset as the input. Based on our experiment results, we find that the SSIM value varies significantly given different input images. Two examples are shown in Figure 5. This is because when the input image has complex background, the original saliency map contains both foreground features and background features. The network, on the other hand, is trained to only capture the main object. As a result, there is a gap between the saliency map and our proposed SR map, which leads to a low SSIM value. Therefore, instead of comparing the absolute SSIM values directly, we choose the ratio of these two SSIM values. As is shown in Table 1, our SR map consistently outscores the LRP relevance map.
|(LRP and saliency map)||(SR and saliency map)||ratio|
|average ratio of images||N/A||N/A||1.7038|
These examples support the claim that the CNN model treats visual input in a fashion quite similar to the human visual system. The latter has been well studied. In that sense, our SR map, derived from the LRP-generated relevance map, provides the link among these two systems, the CNN and the real network that is part of the human visual system.
4 Case Studies and Discussions
In this section. we further explore potential applications of the SR map via several case studies. All of our experiments were conducted using a NVIDIA GTX 1080 graphics card on Ubuntu 16.04 LTS. We implemented the layer-wise relevance propagation (LRP) algorithm 111http://www.heatmapping.org/tutorial/
using the PyTorch222https://github.com/pytorch/pytorch package. We used the Matlab code 333Code available at http://webee.technion.ac.il/cgm/Computer-Graphics-Multimedia/Software/Saliency/Saliency.html offered by the original authors of the context-aware saliency detection scheme. Our neural networks are pre-trained models included in the PyTorch package whose benchmark performances on ILSVRC 2012 validation dataset are also publicly available 444https://github.com/jcjohnson/cnn-benchmarks. Our code is written in Jupyter Notebook and can be obtained from Github 555https://github.com/Hey1Li/Salient-Relevance-Propagation
, which includes the settings for all our experiments. Since we use pre-trained models for our experiments, we do not need to train the neural networks from scratch. The inference time is also negligible because it is comparable to one backpropagation step without updating any parameters. The saliency detection step, on the other hand, takes up most of the testing time. In our case, the running time of one image using context-aware saliency detection algorithm varies fromseconds to seconds depending on image complexity.
4.1 AlexNet vs. VGG-16
In the first experiment, we analyze the performance gap between AlexNet krizhevsky2012imagenet and VGG-16 simonyan2014very . According to the benchmark, AlexNet achieves top-1 error rate while VGG-16’s top-1 error rate is only . Therefore, in order to explain why VGG-16 outperforms AlexNet on ImageNet, we apply our algorithm on images mislabeled by AlexNet but correctly classified by VGG-16 and examine their SR maps. The example shown in Figure 6 is comprehensively reviewed here. More images are included in Figure 7 and Figure 8.
The input image is labeled “coach” as its ground truth. VGG-16 recognizes it correctly, but AlexNet misinterpreted it as “cinema”. At first glance, it is difficult to understand why AlexNet can make such a mistake because a cinema is hugely different from a transit bus. But by looking at the SR map of AlexNet, it is obvious that some prominent pixels belong to the bus while others belong to the buildings in the back. Therefore, it is reasonable to infer that AlexNet actually identifies the coach as part of the whole structure. Since it treats the red bus as the marquee above the entrance, AlexNet thus classifies the image as a cinema with high confidence. On the contrary, VGG-16 successfully separates the bus from the building because all dominant pixels in SR map fall within the area of the coach. As we can see in Figure 7 and Figure 8, this is not an isolated incident. VGG-16 persistently outperforms AlexNet in dividing different objects. This is why VGG-16 beats AlexNet in this test.
4.2 Evaluation of VGG-16
We have explained how VGG-16 makes correct decisions in the previous experiment. But VGG-16 still mislabels about one fourth of all the validation images. As a result, we utilize the SR map to analyze VGG-16’s mistakes in this study. The input image in Figure 9 depicts a vivid scene where a gray wolf is running away from an ox. The wolf is located in the center and is most attractive to human attention indicated by the saliency map. However, VGG-16 turns its attention to the black ox on the left side since only that area is highlighted in the SR map. This coincides with the fact that VGG-16 labels this image as “ox” while the ground truth is “gray wolf”. So unlike AlexNet which fails to differentiate multiple objects, VGG-16 is only capable of recognizing single object from the input image and it ignores the proper context information. This is the reason why VGG-16 is wrong in this case. Since ground truth labels are based on human consensus, we claim that VGG-16 fails to grab true meanings of input images despite its capability of recognizing objects. More examples are included in Figure 10 and Figure 11.
4.3 VGG-16 vs. VGG-Face
The goal of our final experiment is to use the SR map to demonstrate the versatility of neural network models. VGG-16 and VGG-Face parkhi2015deep share the same 16-layers network structure. The difference is that VGG-Face is trained using a dataset consisted of million facial images while VGG-16 is trained using ImageNet with no exposure to human faces. Thus given the same input image with a girl holding a puppy, these two network models are expected to have different attention areas and highlight various regions. As is shown in Figure 12, our method visibly shows this distinction. VGG-Face only emphasizes the girl’s face in SR map. VGG-16, on the other hand, focuses on the dog’s face and label it as “Labrador Retriever” which is one of the most popular types of dog in North America. Figure 13 and Figure 14 include more examples where all the input images contain one human face and another prominent object.
To summarize, we conducted three experiments to demonstrate our method’s capability of visualizing network models’ real comprehension. In the first experiment, we compared AlexNet with VGG-16 and explained their performance gap on the ImageNet classification task. We then further analyzed VGG-16 by examining its mistakes. In the last study, we show that the same network structure can correspond to varied objects if trained on different datasets. And our experiment results evidently prove the effectiveness of our algorithm.
In this paper, we have successfully proposed a novel two-step visualization algorithm to generate SR map 111source code available at https://github.com/Hey1Li/Salient-Relevance-Propagation, which aims to understand deep CNN models and reveal areas from which the models learn representative features. These areas are referred to as attention areas. By combining layer-wise relevance propagation with context-aware saliency detection, our proposed method successfully reveals a CNN model’s visual attention and thus true perception of input images. Experimental results using several well-known models on the ILSVRC2012 validation dataset have shown that SR map not only is capable of revealing neural network’s perception but also is a superior tool for helping researchers understand deep learning models.
In the future, we plan to apply our method to analyze performances of more complex neural network models such as ResNet. Further, we will build direct connections between our visual analysis and proper training adjustments. As a consequence, our visualization tool can be directly applied to improve performances of deep network models.
This work is supported by Midea Corporate Research Center University Program. We would also like to show our gratitude to Prof. Tengyu Ma for his comments on an earlier version of the manuscript.
This research was also partially supported by NSF grant IIS 1527200, as well as the MSIP (Ministry of Science, ICT and Future Planning), Korea, under the “ITCCP Program” directed by NIPA.
- (1) A. Krizhevsky, I. Sutskever, G. E. Hinton, Imagenet classification with deep convolutional neural networks, in: Advances in neural information processing systems, 2012, pp. 1097–1105.
- (2) Z. Chen, Z. Chi, H. Fu, D. Feng, Multi-instance multi-label image classification: A neural approach, Neurocomputing 99 (2013) 298–306.
- (3) K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale image recognition, in: Proc. International Conference on Learning Representations, 2015.
- (4) C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, A. Rabinovich, Going deeper with convolutions, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 1–9.
- (5) K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
M. D. Tissera, M. D. McDonnell, Deep extreme learning machines: supervised autoencoding architecture for classification, Neurocomputing 174 (2016) 42–49.
- (7) Y. Guo, Y. Liu, A. Oerlemans, S. Lao, S. Wu, M. S. Lew, Deep learning for visual understanding: A review, Neurocomputing 187 (2016) 27–48.
- (8) S. Yu, S. Jia, C. Xu, Convolutional neural networks for hyperspectral image classification, Neurocomputing 219 (2017) 88–98.
- (9) A. W. Harley, An interactive node-link visualization of convolutional neural networks, in: International Symposium on Visual Computing, Springer, 2015, pp. 867–877.
- (10) M. Liu, J. Shi, Z. Li, C. Li, J. Zhu, S. Liu, Towards better analysis of deep convolutional neural networks, IEEE transactions on visualization and computer graphics 23 (1) (2017) 91–100.
- (11) D. Erhan, Y. Bengio, A. Courville, P. Vincent, Visualizing higher-layer features of a deep network, University of Montreal 1341 (2009) 3.
- (12) K. Simonyan, A. Vedaldi, A. Zisserman, Deep inside convolutional networks: Visualising image classification models and saliency maps, in: ICLR Workshop, 2014.
- (13) J. Yosinski, J. Clune, A. Nguyen, T. Fuchs, H. Lipson, Understanding neural networks through deep visualization, in: ICML Workshop on Deep Learning, 2015.
- (14) A. Mahendran, A. Vedaldi, Visualizing deep convolutional neural networks using natural pre-images, International Journal of Computer Vision 120 (3) (2016) 233–255.
- (15) A. Dosovitskiy, T. Brox, Inverting visual representations with convolutional networks, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 4829–4837.
- (16) D. Bau, B. Zhou, A. Khosla, A. Oliva, A. Torralba, Network dissection: Quantifying interpretability of deep visual representations, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017.
- (17) A. Nguyen, J. Yosinski, J. Clune, Multifaceted feature visualization: Uncovering the different types of features learned by each neuron in deep neural networks, in: ICML Workshop on Visualization for Deep Learning, 2016.
- (18) M. Kahng, P. Y. Andrews, A. Kalro, D. H. P. Chau, Activis: Visual exploration of industry-scale deep neural network models, IEEE transactions on visualization and computer graphics 24 (1) (2018) 88–97.
- (19) N. Pezzotti, T. Höllt, J. Van Gemert, B. P. Lelieveldt, E. Eisemann, A. Vilanova, Deepeyes: Progressive visual analytics for designing deep neural networks, IEEE transactions on visualization and computer graphics 24 (1) (2018) 98–108.
- (20) S. Bach, A. Binder, G. Montavon, F. Klauschen, K.-R. Müller, W. Samek, On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation, PloS one 10 (7) (2015) e0130140.
- (21) S. Goferman, L. Zelnik-Manor, A. Tal, Context-aware saliency detection, IEEE Transactions on Pattern Analysis and Machine Intelligence 34 (10) (2012) 1915–1926.
- (22) K. Cho, A. Courville, Y. Bengio, Describing multimedia content using attention-based encoder-decoder networks, IEEE Transactions on Multimedia 17 (11) (2015) 1875–1886.
- (23) A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin, Attention is all you need, in: Advances in Neural Information Processing Systems, 2017, pp. 5998–6008.
- (24) Y. Kim, C. Denton, L. Hoang, A. M. Rush, Structured attention networks, arXiv preprint arXiv:1702.00887.
- (25) O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al., Imagenet large scale visual recognition challenge, International Journal of Computer Vision 115 (3) (2015) 211–252.
- (26) L. Itti, C. Koch, E. Niebur, A model of saliency-based visual attention for rapid scene analysis, IEEE Transactions on pattern analysis and machine intelligence 20 (11) (1998) 1254–1259.
- (27) M. Carrasco, Visual attention: The past 25 years, Vision research 51 (13) (2011) 1484–1525.
- (28) J. Colombo, The development of visual attention in infancy, Annual review of psychology 52 (1) (2001) 337–367.
- (29) C. Koch, S. Ullman, Shifts in selective visual attention: towards the underlying neural circuitry, in: Matters of intelligence, Springer, 1987, pp. 115–141.
- (30) Z. Zhu, Q. Chen, Y. Zhao, Ensemble dictionary learning for saliency detection, Image and Vision Computing 32 (3) (2014) 180–188.
- (31) K. Oh, M. Lee, G. Kim, S. Kim, Detection of multiple salient objects through the integration of estimated foreground clues, Image and Vision Computing 54 (2016) 31–44.
- (32) K. Terzić, S. Krishna, J. du Buf, Texture features for object salience, Image and Vision Computing 67 (2017) 43–51.
- (33) P. Mukherjee, B. Lall, Saliency and kaze features assisted object segmentation, Image and Vision Computing 61 (2017) 82–97.
- (34) J. Han, D. Zhang, G. Cheng, N. Liu, D. Xu, Advanced deep-learning techniques for salient and category-specific object detection: a survey, IEEE Signal Processing Magazine 35 (1) (2018) 84–100.
- (35) J. Harel, C. Koch, P. Perona, Graph-based visual saliency, in: Advances in neural information processing systems, 2007, pp. 545–552.
- (36) K. Koffka, Principles of Gestalt psychology, Vol. 44, Routledge, 2013.
A. Shrikumar, P. Greenside, A. Kundaje, Learning important features through propagating activation differences, in: Proceedings of the 34th International Conference on Machine Learning, ICML, 2017, pp. 3145–3153.
- (38) G. Montavon, S. Lapuschkin, A. Binder, W. Samek, K.-R. Müller, Explaining nonlinear classification decisions with deep taylor decomposition, Pattern Recognition 65 (2017) 211–222.
- (39) H. Li, J. Chen, H. Lu, Z. Chi, Cnn for saliency detection with low-level feature integration, Neurocomputing 226 (2017) 212–220.
- (40) P.-J. Kindermans, K. T. Schütt, M. Alber, D. Erhan, B. Kim, S. Dähne, Learning how to explain neural networks: Patternnet and patternattribution, arXiv preprint arXiv:1705.05598v2.
- (41) P.-J. Kindermans, S. Hooker, J. Adebayo, M. Alber, K. T. Schütt, S. Dähne, D. Erhan, B. Kim, The (un) reliability of saliency methods, in: NIPS Workshop 2017 on Explaining and Visualizing Deep Learning, 2017.
- (42) J. Canny, A computational approach to edge detection, in: Readings in Computer Vision, Elsevier, 1987, pp. 184–203.
- (43) Z. Wang, A. C. Bovik, H. R. Sheikh, E. P. Simoncelli, Image quality assessment: from error visibility to structural similarity, IEEE transactions on image processing 13 (4) (2004) 600–612.
O. M. Parkhi, A. Vedaldi, A. Zisserman, et al., Deep face recognition., in: BMVC, Vol. 1, 2015, p. 6.