Using Visual Cropping to Enhance Fine-Detail Question Answering of BLIP-Family Models

05/31/2023
by   Jiarui Zhang, et al.
0

Visual Question Answering is a challenging task, as it requires seamless interaction between perceptual, linguistic, and background knowledge systems. While the recent progress of visual and natural language models like BLIP has led to improved performance on this task, we lack understanding of the ability of such models to perform on different kinds of questions and reasoning types. As our initial analysis of BLIP-family models revealed difficulty with answering fine-detail questions, we investigate the following question: Can visual cropping be employed to improve the performance of state-of-the-art visual question answering models on fine-detail questions? Given the recent success of the BLIP-family models, we study a zero-shot and a fine-tuned BLIP model. We define three controlled subsets of the popular VQA-v2 benchmark to measure whether cropping can help model performance. Besides human cropping, we devise two automatic cropping strategies based on multi-modal embedding by CLIP and BLIP visual QA model gradients. Our experiments demonstrate that the performance of BLIP model variants can be significantly improved through human cropping, and automatic cropping methods can produce comparable benefits. A deeper dive into our findings indicates that the performance enhancement is more pronounced in zero-shot models than in fine-tuned models and more salient with smaller bounding boxes than larger ones. We perform case studies to connect quantitative differences with qualitative observations across question types and datasets. Finally, we see that the cropping enhancement is robust, as we gain an improvement of 4.59 simply inputting a concatenation of the original and gradient-based cropped images. We make our code available to facilitate further innovation on visual cropping methods for question answering.

READ FULL TEXT

page 4

page 9

page 14

page 16

research
06/16/2023

Investigating Prompting Techniques for Zero- and Few-Shot Visual Question Answering

Visual question answering (VQA) is a challenging task that requires the ...
research
05/12/2021

Encoding Explanatory Knowledge for Zero-shot Science Question Answering

This paper describes N-XKT (Neural encoding based on eXplanatory Knowled...
research
10/25/2022

RoMQA: A Benchmark for Robust, Multi-evidence, Multi-answer Question Answering

We introduce RoMQA, the first benchmark for robust, multi-evidence, mult...
research
05/25/2022

Intermediate Training on Question Answering Datasets Improves Generative Data Augmentation

Manually annotating datasets requires domain experts to read through man...
research
05/26/2023

Zero-shot Visual Question Answering with Language Model Feedback

In this paper, we propose a novel language model guided captioning appro...
research
12/22/2022

When are Lemons Purple? The Concept Association Bias of CLIP

Large-scale vision-language models such as CLIP have shown impressive pe...
research
04/24/2023

Better Question-Answering Models on a Budget

Low-rank adaptation (LoRA) and question-answer datasets from large langu...

Please sign up or login with your details

Forgot password? Click here to reset