Detecting and Preventing Hallucinations in Large Vision Language Models

08/11/2023
by   Anisha Gunjal, et al.
0

Instruction tuned Large Vision Language Models (LVLMs) have made significant advancements in generalizing across a diverse set of multimodal tasks, especially for Visual Question Answering (VQA). However, generating detailed responses that are visually grounded is still a challenging task for these models. We find that even the current state-of-the-art LVLMs (InstructBLIP) still contain a staggering 30 percent of hallucinatory text in the form of non-existent objects, unfaithful descriptions, and inaccurate relationships. To address this, we introduce M-HalDetect, a Multimodal Hallucination Detection Dataset that can be used to train and benchmark models for hallucination detection and prevention. M-HalDetect consists of 16k fine-grained labels on VQA examples, making it the first comprehensive multi-modal hallucination detection dataset for detailed image descriptions. Unlike previous work that only consider object hallucination, we additionally annotate both entity descriptions and relationships that are unfaithful. To demonstrate the potential of this dataset for preference alignment, we propose fine-grained Direct Preference Optimization, as well as train fine-grained multi-modal reward models and evaluate their effectiveness with best-of-n rejection sampling. We perform human evaluation on both DPO and rejection sampling, and find that they reduce hallucination rates by 41 respectively, a significant improvement over the baseline.

READ FULL TEXT

page 2

page 7

page 13

page 14

page 15

research
06/15/2023

Encyclopedic VQA: Visual questions about detailed properties of fine-grained categories

We propose Encyclopedic-VQA, a large scale visual question answering (VQ...
research
08/18/2023

Towards Grounded Visual Spatial Reasoning in Multi-Modal Vision Language Models

With the advances in large scale vision-and-language models (VLMs) it is...
research
08/04/2017

Multi-modal Factorized Bilinear Pooling with Co-Attention Learning for Visual Question Answering

Visual question answering (VQA) is challenging because it requires a sim...
research
04/12/2020

YouMakeup VQA Challenge: Towards Fine-grained Action Understanding in Domain-Specific Videos

The goal of the YouMakeup VQA Challenge 2020 is to provide a common benc...
research
06/01/2023

Evaluating the Capabilities of Multi-modal Reasoning Models with Synthetic Task Data

The impressive advances and applications of large language and joint lan...
research
12/06/2019

Weak Supervision helps Emergence of Word-Object Alignment and improves Vision-Language Tasks

The large adoption of the self-attention (i.e. transformer model) and BE...
research
06/02/2023

Fine-Grained Human Feedback Gives Better Rewards for Language Model Training

Language models (LMs) often exhibit undesirable text generation behavior...

Please sign up or login with your details

Forgot password? Click here to reset