HL Dataset: Grounding High-Level Linguistic Concepts in Vision

02/23/2023
by   Michele Cafagna, et al.
4

Current captioning datasets, focus on object-centric captions, describing the visible objects in the image, often ending up stating the obvious (for humans), e.g. "people eating food in a park". Although these datasets are useful to evaluate the ability of Vision Language models to recognize the visual content, they lack in expressing trivial abstract concepts, e.g. "people having a picnic". Such concepts are licensed by human's personal experience and contribute to forming common sense assumptions. We present the High-Level Dataset; a dataset extending 14997 images of the COCO dataset with 134973 human-annotated (high-level) abstract captions collected along three axes: scenes, actions and rationales. We describe and release such dataset and we show how it can be used to assess models' multimodal grounding of abstract concepts and enrich models' visio-lingusitic representations. Moreover, we describe potential tasks enabled by this dataset involving high- and low-level concepts interactions.

READ FULL TEXT

page 3

page 6

page 8

page 9

page 10

page 15

page 16

page 17

research
06/21/2021

TCIC: Theme Concepts Learning Cross Language and Vision for Image Captioning

Existing research for image captioning usually represents an image using...
research
01/19/2019

Binary Image Selection (BISON): Interpretable Evaluation of Visual Grounding

Providing systems the ability to relate linguistic and visual content is...
research
08/16/2021

Who's Waldo? Linking People Across Text and Images

We present a task and benchmark dataset for person-centric visual ground...
research
05/24/2023

Exploring the Grounding Issues in Image Caption

This paper explores the grounding issue concerning multimodal semantic r...
research
10/08/2017

Annotating High-Level Structures of Short Stories and Personal Anecdotes

Stories are a vital form of communication in human culture; they are emp...
research
01/19/2021

ArtEmis: Affective Language for Visual Art

We present a novel large-scale dataset and accompanying machine learning...
research
12/06/2019

Connecting Vision and Language with Localized Narratives

We propose Localized Narratives, an efficient way to collect image capti...

Please sign up or login with your details

Forgot password? Click here to reset