Evaluating the Capabilities of Multi-modal Reasoning Models with Synthetic Task Data

06/01/2023
by   Nathan Vaska, et al.
0

The impressive advances and applications of large language and joint language-and-visual understanding models has led to an increased need for methods of probing their potential reasoning capabilities. However, the difficulty of gather naturally-occurring data for complex multi-modal reasoning tasks bottlenecks the evaluation of AI methods on tasks which are not already covered by an academic dataset. In this work, we leverage recent advances in high resolution text-to-image generation to develop a framework for generating evaluation data for multi-modal reasoning tasks. We apply this framework to generate context-dependent anomaly data, creating a synthetic dataset on a challenging task which is not well covered by existing datasets. We benchmark the performance of a state-of-the-art visual question answering (VQA) model against data generated with this method, and demonstrate that while the task is tractable, the model performs significantly worse on the context-dependent anomaly detection task than on standard VQA tasks.

READ FULL TEXT

page 5

page 7

page 8

page 12

page 13

page 14

research
01/29/2018

Object-based reasoning in VQA

Visual Question Answering (VQA) is a novel problem domain where multi-mo...
research
09/24/2017

Survey of Recent Advances in Visual Question Answering

Visual Question Answering (VQA) presents a unique challenge as it requir...
research
01/24/2022

Question Generation for Evaluating Cross-Dataset Shifts in Multi-modal Grounding

Visual question answering (VQA) is the multi-modal task of answering nat...
research
10/17/2021

Towards Language-guided Visual Recognition via Dynamic Convolutions

In this paper, we are committed to establishing an unified and end-to-en...
research
08/10/2022

CLEVR-Math: A Dataset for Compositional Language, Visual and Mathematical Reasoning

We introduce CLEVR-Math, a multi-modal math word problems dataset consis...
research
08/11/2023

Detecting and Preventing Hallucinations in Large Vision Language Models

Instruction tuned Large Vision Language Models (LVLMs) have made signifi...
research
05/23/2023

Image Manipulation via Multi-Hop Instructions – A New Dataset and Weakly-Supervised Neuro-Symbolic Approach

We are interested in image manipulation via natural language text – a ta...

Please sign up or login with your details

Forgot password? Click here to reset