Rectify ViT Shortcut Learning by Visual Saliency

06/17/2022
by   Chong Ma, et al.
12

Shortcut learning is common but harmful to deep learning models, leading to degenerated feature representations and consequently jeopardizing the model's generalizability and interpretability. However, shortcut learning in the widely used Vision Transformer framework is largely unknown. Meanwhile, introducing domain-specific knowledge is a major approach to rectifying the shortcuts, which are predominated by background related factors. For example, in the medical imaging field, eye-gaze data from radiologists is an effective human visual prior knowledge that has the great potential to guide the deep learning models to focus on meaningful foreground regions of interest. However, obtaining eye-gaze data is time-consuming, labor-intensive and sometimes even not practical. In this work, we propose a novel and effective saliency-guided vision transformer (SGT) model to rectify shortcut learning in ViT with the absence of eye-gaze data. Specifically, a computational visual saliency model is adopted to predict saliency maps for input image samples. Then, the saliency maps are used to distil the most informative image patches. In the proposed SGT, the self-attention among image patches focus only on the distilled informative ones. Considering this distill operation may lead to global information lost, we further introduce, in the last encoder layer, a residual connection that captures the self-attention across all the image patches. The experiment results on four independent public datasets show that our SGT framework can effectively learn and leverage human prior knowledge without eye gaze data and achieves much better performance than baselines. Meanwhile, it successfully rectifies the harmful shortcut learning and significantly improves the interpretability of the ViT model, demonstrating the promise of transferring human prior knowledge derived visual saliency in rectifying shortcut learning

READ FULL TEXT

page 3

page 6

page 8

page 17

page 18

page 19

page 20

page 21

research
05/25/2022

Eye-gaze-guided Vision Transformer for Rectifying Shortcut Learning

Learning harmful shortcuts such as spurious correlations and biases prev...
research
11/29/2016

Predicting Human Eye Fixations via an LSTM-based Saliency Attentive Model

Data-driven saliency has recently gained a lot of attention thanks to th...
research
12/22/2021

Comparing radiologists' gaze and saliency maps generated by interpretability methods for chest x-rays

The interpretability of medical image analysis models is considered a ke...
research
10/20/2022

SSiT: Saliency-guided Self-supervised Image Transformer for Diabetic Retinopathy Grading

Self-supervised learning (SSL) has been widely applied to learn image re...
research
05/20/2022

Mask-guided Vision Transformer (MG-ViT) for Few-Shot Learning

Learning with little data is challenging but often inevitable in various...
research
07/06/2023

Few-Shot Personalized Saliency Prediction Using Tensor Regression for Preserving Structural Global Information

This paper presents a few-shot personalized saliency prediction using te...
research
01/12/2016

Human Attention Estimation for Natural Images: An Automatic Gaze Refinement Approach

Photo collections and its applications today attempt to reflect user int...

Please sign up or login with your details

Forgot password? Click here to reset