Understanding ME? Multimodal Evaluation for Fine-grained Visual Commonsense

11/10/2022
by   Zhecan Wang, et al.
0

Visual commonsense understanding requires Vision Language (VL) models to not only understand image and text but also cross-reference in-between to fully integrate and achieve comprehension of the visual scene described. Recently, various approaches have been developed and have achieved high performance on visual commonsense benchmarks. However, it is unclear whether the models really understand the visual scene and underlying commonsense knowledge due to limited evaluation data resources. To provide an in-depth analysis, we present a Multimodal Evaluation (ME) pipeline to automatically generate question-answer pairs to test models' understanding of the visual scene, text, and related knowledge. We then take a step further to show that training with the ME data boosts the model's performance in standard VCR evaluation. Lastly, our in-depth analysis and comparison reveal interesting findings: (1) semantically low-level information can assist the learning of high-level information but not the opposite; (2) visual information is generally under utilization compared with text.

READ FULL TEXT

page 1

page 4

page 10

page 11

research
05/14/2022

What do Models Learn From Training on More Than Text? Measuring Visual Commonsense Knowledge

There are limitations in learning language from text alone. Therefore, r...
research
12/16/2021

SGEITL: Scene Graph Enhanced Image-Text Learning for Visual Commonsense Reasoning

Answering complex questions about images is an ambitious goal for machin...
research
08/06/2021

Interpretable Visual Understanding with Cognitive Attention Network

While image understanding on recognition-level has achieved remarkable a...
research
09/14/2021

Broaden the Vision: Geo-Diverse Visual Commonsense Reasoning

Commonsense is defined as the knowledge that is shared by everyone. Howe...
research
04/17/2022

Attention Mechanism based Cognition-level Scene Understanding

Given a question-image input, the Visual Commonsense Reasoning (VCR) mod...
research
08/07/2023

Tiny LVLM-eHub: Early Multimodal Experiments with Bard

Recent advancements in Large Vision-Language Models (LVLMs) have demonst...
research
05/20/2023

Scene Graph as Pivoting: Inference-time Image-free Unsupervised Multimodal Machine Translation with Visual Scene Hallucination

In this work, we investigate a more realistic unsupervised multimodal ma...

Please sign up or login with your details

Forgot password? Click here to reset