Broaden the Vision: Geo-Diverse Visual Commonsense Reasoning

09/14/2021
by   Da Yin, et al.
0

Commonsense is defined as the knowledge that is shared by everyone. However, certain types of commonsense knowledge are correlated with culture and geographic locations and they are only shared locally. For example, the scenarios of wedding ceremonies vary across regions due to different customs influenced by historical and religious factors. Such regional characteristics, however, are generally omitted in prior work. In this paper, we construct a Geo-Diverse Visual Commonsense Reasoning dataset (GD-VCR) to test vision-and-language models' ability to understand cultural and geo-location-specific commonsense. In particular, we study two state-of-the-art Vision-and-Language models, VisualBERT and ViLBERT trained on VCR, a standard multimodal commonsense benchmark with images primarily from Western regions. We then evaluate how well the trained models can generalize to answering the questions in GD-VCR. We find that the performance of both models for non-Western regions including East Asia, South Asia, and Africa is significantly lower than that for Western region. We analyze the reasons behind the performance disparity and find that the performance gap is larger on QA pairs that: 1) are concerned with culture-related scenarios, e.g., weddings, religious activities, and festivals; 2) require high-level geo-diverse commonsense reasoning rather than low-order perception and recognition. Dataset and code are released at https://github.com/WadeYin9712/GD-VCR.

READ FULL TEXT

page 2

page 4

page 14

page 15

research
05/14/2022

What do Models Learn From Training on More Than Text? Measuring Visual Commonsense Knowledge

There are limitations in learning language from text alone. Therefore, r...
research
07/12/2023

Can Vision-Language Models be a Good Guesser? Exploring VLMs for Times and Location Reasoning

Vision-Language Models (VLMs) are expected to be capable of reasoning wi...
research
11/24/2022

TSGP: Two-Stage Generative Prompting for Unsupervised Commonsense Question Answering

Unsupervised commonsense question answering requires mining effective co...
research
05/12/2020

WinoWhy: A Deep Diagnosis of Essential Commonsense Knowledge for Answering Winograd Schema Challenge

In this paper, we present the first comprehensive categorization of esse...
research
09/15/2022

VIPHY: Probing "Visible" Physical Commonsense Knowledge

In recent years, vision-language models (VLMs) have shown remarkable per...
research
11/10/2022

Understanding ME? Multimodal Evaluation for Fine-grained Visual Commonsense

Visual commonsense understanding requires Vision Language (VL) models to...
research
09/11/2020

An Atlas of Cultural Commonsense for Machine Reasoning

Existing commonsense reasoning datasets for AI and NLP tasks fail to add...

Please sign up or login with your details

Forgot password? Click here to reset