Implicit Saliency in Deep Neural Networks

08/04/2020 ∙ by Yutong Sun, et al. ∙ Georgia Institute of Technology 22

In this paper, we show that existing recognition and localization deep architectures, that have not been exposed to eye tracking data or any saliency datasets, are capable of predicting the human visual saliency. We term this as implicit saliency in deep neural networks. We calculate this implicit saliency using expectancy-mismatch hypothesis in an unsupervised fashion. Our experiments show that extracting saliency in this fashion provides comparable performance when measured against the state-of-art supervised algorithms. Additionally, the robustness outperforms those algorithms when we add large noise to the input images. Also, we show that semantic features contribute more than low-level features for human visual saliency detection.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1:

Implicit Saliency Generation Process: We give an input image to a pretrained network and get an output vector based on prior knowledge of the network. We provide an unexpected stimuli which is a vector conflict to the output vector. We use a loss function to encode the error based on the output vector and the conflict vector. Backpropagating this error to a semantic convolutional layer results in pseudo saliency maps. We combine the resultant pseudo saliency maps using statistical methods to obtain the final saliency map.

Saliency is defined as those regions in a visual scene that are ‘most noticeable’ and attract significant attention [30]. Human visual saliency detection has been deployed in an extensive set of image processing applications including but not limited to data compression, image segmentation, recognition, image quality assessment (IQA) and object recognition [1]

. Broadly, saliency detection algorithms can be classified into two categories. The first is bottom-up approaches where saliency detection techniques extract features from data and compute saliency based on extracted features 

[16, 6, 5]. The second is top-down approaches where the algorithms have a prior target for which features are to be calculated  [20]. Both these approaches derive from the expectancy mismatch hypothesis [2].

The expectancy-mismatch hypothesis for a sensory system is based on receiving information which is in conflict with the system’s prior expectation. The authors in [2] show that a message which is unexpected, captures human attention and is hence salient. Extensive work in the field of cognitive sciences has been conducted to study the impact of expectancy-mismatch in human attention and visual saliency [27, 15, 12, 13, 2]

. Based on these works, human attention mechanism suppresses expected messages and focuses on the unexpected ones. During this process, human visual system checks whether the input scenario matches the observers’ expectation and past experience. When they are conflicting, error neurons in human brain encode the prediction error and pass the error message back to the representational neurons. Existing work applies this concept of expectancy-mismatch to saliency detection. The authors in 

[12, 13] show how unexpected colors impact human eye fixations. [2] indicates that a motion singleton captures attention.

Previous works that define expectations and calculate mismatches are based on low-level representations like colors and edges. However, the advent of deep learning has shown the importance of semantic information that combines low-level features for complicated tasks like recognition. Neural networks have shown an aptitude for learning higher-order semantic representations. In [28], the authors claim that it is crucial to consider semantic representations in saliency detection. In this paper, we propose to create expectancy based on high level semantic features and calculate mismatch from input information to obtain saliency. To set expectancy, we use neural networks. To calculate mismatch, we provide conflicting information to the network along with the input image to search for those regions in the input image that are affected by the conflict. In this work, conflicting information refers to labels that conflict with predicted classes. For instance, consider Fig. 1. The network has learned the low-level features like edges and colors and their combinatorial high-level semantics to recognize a car. However, by providing a conflicting label such as ‘airplane’, we force the network to reexamine its decision process. The network reconciles its expectation of finding a car and the conflicting label that it is an airplane by encoding the error within the gradients. These gradients are backpropagated throughout the network to resolve the conflict. The change brought about by the gradients is indicative of regions within the image that are used for expecting the output. We postulate that these regions are thereby salient.

In this work, we use commonly used recognition and localization pre-trained networks to set expectancy. These networks have not been exposed to either saliency datasets or eye-tracking data. Hence, the proposed method is completely unsupervised. We extract saliency that is implicitly embedded within any given network. Hence, the proposed approach is termed implicit saliency in neural networks. The contributions in this paper are three-fold:

  1. we extract implicit saliency from pre-trained networks that have not seen eye-tracking data in an unsupervised fashion.

  2. we show that the proposed implicit saliency is robust to noise.

  3. we show that semantic features combined with unexpected stimuli have a higher correlation with human visual saliency than low-level features or semantic features without unexpected stimuli.

We introduce the background for the pre-trained deep neural networks in Section  2. In Section 3, we detail the proposed method to extract implicit saliency. In Section 4, we compare the performance of proposed method against state-of-art supervised methods and model saliency methods. We conclude in Sec. 5.

2 Background

Visual recognition is a common activity on which humans heavily rely to interact within their environments both accurately and rapidly. It has been shown that the process of recognition and categorization inside humans takes place within  [8, 29]. Furthermore, this process is performed in the cortex area of the brain, which is controlled by human attention mechanism. Hence recognition highly relates to human attention selection [10]. Therefore, in this paper recognition networks are utilized as the backbone of the proposed implicit saliency as shown in Fig. 1. Specifically, we utilize ResNet-, , ,  [11], and VGG [25]

that are pre-trained on ImageNet 

[7] as well as Faster R-CNN [23] that is pre-trained on PASCAL VOC 2007+2012 [9]

. We implement this project in PyTorch.

We denote an layered network as , its weights as , and bias as . During the training process, and are updated until the model is parameterized by these these weights and bias. A pre-trained network provides the expectation based on the prior knowledge obtained during the feed-forward process. We call such features as feed-forward features. We also denote the provided conflicting information mentioned in Sec. 1 as and the gradients corresponding to the encoding error as conflicting features.

3 Implicit Saliency Generation

Figure 2: Conflicting feature generation
Figure 3: Saliency map visualization. (a) Input image (b) Groudtruth (c) Proposed Method (d) Feed-forward feature (e) SalGan [21] (f) ML-Net [5] (g) DeepGazeII [17] (h) ShallowDeep [22]
Figure 4: Visualization of NSS and CC gains (in red) of using the proposed conflicting features over feed-forward features on MIT1003

Let the pre-trained network have classes. Therefore, the prediction and the unexpected stimulus are vectors. Each class has a corresponding one-hot unexpected stimulus vector . As shown in Fig. 2, having and in hand, we encode the unexpected information by a convex loss function . The encoded unexpected information is denoted as , where is the weight, is the input, and is the assigned class in the unexpected stimulus. Note that this formulation pf is similar to the one described in [18]. In this paper, we use CrossEntropy function for . The generation process of proposed implicit saliency map is shown in Alg. 1. represents the gradients to resolve the conflict on a specific semantic layer. Since there are classes in this pre-trained network, we get pseudo-saliency maps. Each pseudo-saiency map shows the salient region corresponding to the class in the given unexpected stimulus. Notice that we also take the network’s decision class as an unexpected stimulus because the decision comes from the highest score in the output, but the network is still not sure with its decision thereby providing a non-zero loss .

  1. Generate pseudo saliency maps on a specific convolution layer as:

: feature maps corresponding to the convolution filter
: index for each convolution filter
: specific convolution layer
: spatial indices
  • Average pseudo saliency maps over class dimension:

  • Generate the variance mask over class dimension:

  • Generate implicit saliency map:

  • : element-wise multiplication
    :normalized variance map.
    Algorithm 1 Implicit Saliency Map Generation

    As shown in Fig. 1, for different classes, pseudo-saliency maps focus on different salient regions. The final implicit saliency map is the combination of all these salient regions. Specifically, we use mean to combine all the pseudo saliency maps. To decrease the uncertainty in the overall saliency map, we negate the variance of all the pseudo saliency maps, which in shown in Step 4 of Alg.  1. Qualitatively, we can see that the right part of the implicit saliency map is primarily influenced by the pseudo-saliency maps from airplane-like classes while the left part is derived from the car-like classes. Note that each pseudo-saliency map is a matrix with the same dimensionalities as the convolution filters in th layer. In Fig. 1, we sum it up over depth dimension and rescale it to for visualization.

    4 Experiments

    In Section 1, we motivate implicit saliency using expectancy-mismatch. We generate saliency maps by highlighting regions where the network is re-examining its decision process because of provided conflict. In this section, we sequentially validate these arguments. First, we demonstrate that expectancy-mismatch in neural networks has a higher correlation with saliency as compared to feed-forward expectancy features. As such we compare the proposed implicit saliency against feed-forward feature maps. Next, we show that the regions that are in conflict with the decision process are more salient than the regions used to make the decision. Finally, we compare the performance and robustness of proposed unsupervised implicit saliency against state-of-the-art supervised saliency detection methods.

    4.1 Implicit saliency of pre-trained networks

    The expectancy of an input image is encoded in the feed-forward activation maps. In this experiment, we compare the saliency maps obtained by feed-forward expectancy features and proposed expectancy-mismatch features. In the feed-forward method, we use the same statistical process as Alg. 1 and denote the saliency map as . Note that instead of using gradients, we use activations to obtain . The saliency map based on conflicting features is shown in Sec. 3. We validate the saliency detection capabilities of feed-forward and conflicting features on MIT1003 [3] dataset. MIT1003 consists of images and the corresponding eye tracking data from human subjects. We qualitatively evaluate the performance of saliency detection in Fig. 3. The feed-forward features in Fig. 3d. focus on edges and textures without specific localization. The proposed implicit saliency generates localized saliency maps which are highly correlated with the ground truth. These visualizations show that the visual saliency is more effectively captured through the expectancy-mismatch process than expectancy process.

    NSS CC
    Networks ResNet-18 ResNet-34 ResNet-50 ResNet-101 ResNet-18 ResNet-34 ResNet-50 ResNet-101
    GradCam
    GBP
    ImplicitSaliency
    Table 1: Human visual saliency vs Model Saliency
    NSS CC
    Gaussian Sal Deep ML Shallow Implicit Sal Deep ML Shallow Implicit
    Blur Gan GazeII Net Deep Saliency Gan GazeII Net Deep Saliency
    Table 2: Robustness Analysis of Implicit Saliency

    We also quantitatively evaluate both methods using Normalized Scanpath Saliency (NSS) and Correlation Coefficient (CC) and report results in Fig. 4. Based on [4], NSS computes the average normalized saliency at fixation locations. High positive NSS indicates high correspondence, while negative NSS indicates anti-corresponds. CC measures the correlation between saliency maps and ground truth. Higher CC indicates better performance. Since different layers extract different semantic information, we show results from convolution layers for each model. Based on Fig. 4, the proposed method outperforms the feed-forward method by on average for NSS, and for CC. Also, the proposed method achieves more robust performance over different layers compared to the feed-forward feature. The maximum performance drop of the feed-forward feature method is and for NSS and CC across different layers. The maximum drop in conflicting feature is only and for NSS and CC, which shows our proposed method achieves stable saliency detection results across layers. These results validate the usage of expectancy-mismatch in semantic layers as compared to feed-forward features in the same layers.

    4.2 Implicit Saliency vs Model Saliency

    In this experiment, we show that regions where the network re-examines its decision are more salient compared to regions that are used to make decisions. Regions that are used to make decisions in a recognition network are obtained using model saliency. In this paper, we consider two model saliency methods - Grad-CAM [24] and Guided Backpropagation (GBP) [26]. These methods are compared against implicit saliency among ResNet-18,34,50,101 architectures. Note that both these model saliency methods are unsupervised methods. Table 1 shows the proposed method outperforms both GradCam and GBP in both NSS and CC metrics. GBP shows the lowest results since it is designed to find only low-level features like edges. Meanwhile, Grad-CAM performs relatively well since it focuses on semantic features in the given image. However, Grad-CAM only uses the model’s decision as guidance to find the salient regions. Our proposed method utilizes the unexpected stimuli to extract high-level semantic features based on expectancy-mismatch hypothesis. From this experiment we can see that while low-level features from GBP are important for a network to make its decision, they have relatively low correlation with human attention. Semantic features that are important for the network to make decision are correlated to human attention. This correlation can be increased by using proposed expectancy-mismatch.

    4.3 Robustness Analysis of Implicit Saliency

    In this experiment, we compare our proposed method with 4 state-of-art methods: SalGan [21], DeepGazeII [17], ML-Net [5] and Shallow and Deep Networks [22]. All these models are trained on SALICON [14]. SALICON is an eye-tracking dataset that offers large number of saliency annotations on commonly used datasets like MS-COCO [19]. We visualize the qualitative resulst of all these methods in Fig. 3 (e), (f), (g), (h). The results from [22] visualized in Fig. 3(h) cover a more comprehensive area, while Fig. 3(e), (f) and (g) provide a higher precision. The proposed implicit saliency Fig. 3(c) is both fine-grained and precise.

    We quantitatively compare these methods against the proposed implicit saliency in Table 2. To ascertain robustness of all the methods, we add Gaussian blur with to input images from MIT1003. When there is no noise in the input images, our proposed method is the third best for NSS and the fourth best for CC among these methods. With noise added, our proposed method’s performance drops the least. For NSS, our performance drop is lower than the state-of-art algorithms on average, and lower than the largest drop. For CC, our performance drop is lower on average, and lower than the largest drop. Note that the comparison methods are supervised networks while the proposed method is completely unsupervised. The comparison methods are well-trained on complex scenarios similar to MIT1003. However, those scenarios are not common in PASCAL VOC 2007+2012 and ImageNet that are used to train networks in our proposed method. Also, the proposed method’s pre-trained networks are trained for tasks other than saliency detection. Despite these handicaps, the proposed implicit saliency based on expectancy-mismatch in semantic information, provides a comparable performance to supervised networks based on both qualitative and quantitative results.

    5 Conclusion

    In this paper we propose implicit saliency that is extracted from a pre-trained network based on expectancy-mismatch hypothesis. This network can be any of classification, detection, or recognition networks. Based on three experiments, we show that our proposed method has higher correlation with human visual saliency than only using feed-forward features. Our method is also stable across layers. The proposed implicit saliency is achieves a comparable performance and is more robust when measured against state-of-the-art supervised saliency detection methods. Additionally, by comparing with two model saliency methods, we show that our unexpected based semantic saliency features have higher correlation with human visual saliency than low-level features and semantic features without unexpected stimuli. Our method is completely unsupervised. This greatly lowers the threshold for saliency detection in terms of data collection. Also, existence of implicit saliency in neural networks can bridge the gap between recognition and neuroscience communities. Human visual saliency can be shown to be embedded into all neural networks thereby increasing the understanding of both saliency and neural networks.

    References

    • [1] T. Alshawi (2018)

      Uncertainty estimation of visual attention models using spatiotemporal analysis

      .
      Ph.D. Thesis, Georgia Institute of Technology. Cited by: §1.
    • [2] S. I. Becker and G. Horstmann (2011) Novelty and saliency in attentional capture by unannounced motion singletons. Acta Psychologica 136 (3), pp. 290 – 299. External Links: ISSN 0001-6918, Document, Link Cited by: §1, §1.
    • [3] Z. Bylinskii, T. Judd, A. Borji, L. Itti, F. Durand, A. Oliva, and A. Torralba MIT saliency benchmark. Cited by: §4.1.
    • [4] Z. Bylinskii, T. Judd, A. Oliva, A. Torralba, and F. Durand (2018)

      What do different evaluation metrics tell us about saliency models?

      .
      IEEE transactions on pattern analysis and machine intelligence 41 (3), pp. 740–757. Cited by: §4.1.
    • [5] M. Cornia, L. Baraldi, G. Serra, and R. Cucchiara (2016) A deep multi-level network for saliency prediction. In

      2016 23rd International Conference on Pattern Recognition (ICPR)

      ,
      pp. 3488–3493. Cited by: §1, Figure 3, §4.3.
    • [6] M. Cornia, L. Baraldi, G. Serra, and R. Cucchiara (2018) SAM: Pushing the Limits of Saliency Prediction Models. In

      Proceedings of the IEEE/CVF International Conference on Computer Vision and Pattern Recognition Workshops

      ,
      Cited by: §1.
    • [7] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) Imagenet: a large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248–255. Cited by: §2.
    • [8] J. J. DiCarlo and D. D. Cox (2007) Untangling invariant object recognition. Trends in Cognitive Sciences 11 (8), pp. 333 – 341. External Links: ISSN 1364-6613, Document, Link Cited by: §2.
    • [9] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman (2010-06) The pascal visual object classes (voc) challenge. International Journal of Computer Vision 88 (2), pp. 303–338. Cited by: §2.
    • [10] D. Graboi and J. Lisman (2003) Recognition by top-down and bottom-up processing in cortex: the control of selective attention. Journal of Neurophysiology 90 (2), pp. 798–810. Note: PMID: 12702712 External Links: Document, Link, https://doi.org/10.1152/jn.00777.2002 Cited by: §2.
    • [11] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §2.
    • [12] G. Horstmann, S. Becker, and D. Ernst (2016) Perceptual salience captures the eyes on a surprise trial. Attention, Perception, & Psychophysics 78 (7), pp. 1889–1900. Cited by: §1.
    • [13] G. Horstmann (2002) Evidence for attentional capture by a surprising color singleton in visual search. Psychological Science 13 (6), pp. 499–505. Note: PMID: 12430832 External Links: Document, Link, https://doi.org/10.1111/1467-9280.00488 Cited by: §1.
    • [14] M. Jiang, S. Huang, J. Duan, and Q. Zhao (2015-06) SALICON: saliency in context. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §4.3.
    • [15] R. Krebs, W. Fias, E. Achten, and C. Boehler (2012) Stimulus conflict and stimulus novelty trigger saliency signals in locus coeruleus and anterior cingulate cortex. In Front. Hum. Neurosci. Conference Abstract: Belgian Brain Council. doi: 10.3389/conf. fnhum, Vol. 114. Cited by: §1.
    • [16] M. Kümmerer, T. S. A. Wallis, and M. Bethge (2016)

      DeepGaze II: reading fixations from deep features trained on object recognition

      .
      CoRR abs/1610.01563. External Links: Link, 1610.01563 Cited by: §1.
    • [17] M. Kümmerer, T. S. Wallis, and M. Bethge (2016) DeepGaze ii: reading fixations from deep features trained on object recognition. arXiv preprint arXiv:1610.01563. Cited by: Figure 3, §4.3.
    • [18] G. Kwon, M. Prabhushankar, D. Temel, and G. AlRegib (2019) Distorted representation space characterization through backpropagated gradients. In 2019 IEEE International Conference on Image Processing (ICIP), pp. 2651–2655. Cited by: §3.
    • [19] T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick (2014) Microsoft coco: common objects in context. In European conference on computer vision, pp. 740–755. Cited by: §4.3.
    • [20] F. Murabito, C. Spampinato, S. Palazzo, D. Giordano, K. Pogorelov, and M. Riegler (2018) Top-down saliency detection driven by visual classification. Computer Vision and Image Understanding 172, pp. 67–76. Cited by: §1.
    • [21] J. Pan, C. C. Ferrer, K. McGuinness, N. E. O’Connor, J. Torres, E. Sayrol, and X. Giro-i-Nieto (2017)

      Salgan: visual saliency prediction with generative adversarial networks

      .
      arXiv preprint arXiv:1701.01081. Cited by: Figure 3, §4.3.
    • [22] J. Pan, E. Sayrol, X. Giro-i-Nieto, K. McGuinness, and N. E. O’Connor (2016-06) Shallow and deep convolutional networks for saliency prediction. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: Figure 3, §4.3.
    • [23] S. Ren, K. He, R. Girshick, and J. Sun (2015) Faster R-CNN: towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems (NIPS), Cited by: §2.
    • [24] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra (2017) Grad-cam: visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE international conference on computer vision, pp. 618–626. Cited by: §4.2.
    • [25] K. Simonyan and A. Zisserman (2015) Very deep convolutional networks for large-scale image recognition. In International Conference on Learning Representations, Cited by: §2.
    • [26] J. T. Springenberg, A. Dosovitskiy, T. Brox, and M. Riedmiller (2014) Striving for simplicity: the all convolutional net. arXiv preprint arXiv:1412.6806. Cited by: §4.2.
    • [27] C. Summerfield and T. Egner (2009) Expectation (and attention) in visual cognition. Trends in Cognitive Sciences 13 (9), pp. 403 – 409. External Links: ISSN 1364-6613, Document, Link Cited by: §1.
    • [28] X. Sun (2018) Semantic and contrast-aware saliency. arXiv preprint arXiv:1811.03736. Cited by: §1.
    • [29] S. Thorpe, D. Fize, and C. Marlot (1996) Speed of processing in the human visual system. nature 381 (6582), pp. 520–522. Cited by: §2.
    • [30] L. Q. Uddin (2016) Salience network of the human brain. Academic press. Cited by: §1.