Taking a HINT: Leveraging Explanations to Make Vision and Language Models More Grounded

Many vision and language models suffer from poor visual grounding - often falling back on easy-to-learn language priors rather than associating language with visual concepts. In this work, we propose a generic framework which we call Human Importance-aware Network Tuning (HINT) that effectively leverages human supervision to improve visual grounding. HINT constrains deep networks to be sensitive to the same input regions as humans. Crucially, our approach optimizes the alignment between human attention maps and gradient-based network importances - ensuring that models learn not just to look at but rather rely on visual concepts that humans found relevant for a task when making predictions. We demonstrate our approach on Visual Question Answering and Image Captioning tasks, achieving state of-the-art for the VQA-CP dataset which penalizes over-reliance on language priors.

READ FULL TEXT

page 4

page 7

page 8

page 11

page 12

research
06/17/2016

Human Attention in Visual Question Answering: Do Humans and Deep Networks Look at the Same Regions?

We conduct large-scale studies on `human attention' in Visual Question A...
research
08/18/2023

Towards Grounded Visual Spatial Reasoning in Multi-Modal Vision Language Models

With the advances in large scale vision-and-language models (VLMs) it is...
research
06/30/2022

Improving Visual Grounding by Encouraging Consistent Gradient-based Explanations

We propose a margin-based loss for vision-language model pretraining tha...
research
07/13/2022

3D Concept Grounding on Neural Fields

In this paper, we address the challenging problem of 3D concept groundin...
research
08/18/2020

Linguistically-aware Attention for Reducing the Semantic-Gap in Vision-Language Tasks

Attention models are widely used in Vision-language (V-L) tasks to perfo...
research
09/22/2019

Learning Visual Relation Priors for Image-Text Matching and Image Captioning with Neural Scene Graph Generators

Grounding language to visual relations is critical to various language-a...
research
07/24/2023

3D-LLM: Injecting the 3D World into Large Language Models

Large language models (LLMs) and Vision-Language Models (VLMs) have been...

Please sign up or login with your details

Forgot password? Click here to reset