Who's Waldo? Linking People Across Text and Images

08/16/2021
by   Claire Yuqing Cui, et al.
0

We present a task and benchmark dataset for person-centric visual grounding, the problem of linking between people named in a caption and people pictured in an image. In contrast to prior work in visual grounding, which is predominantly object-based, our new task masks out the names of people in captions in order to encourage methods trained on such image-caption pairs to focus on contextual cues (such as rich interactions between multiple people), rather than learning associations between names and appearances. To facilitate this task, we introduce a new dataset, Who's Waldo, mined automatically from image-caption data on Wikimedia Commons. We propose a Transformer-based method that outperforms several strong baselines on this task, and are releasing our data to the research community to spur work on contextual models that consider both vision and language.

READ FULL TEXT

page 4

page 5

page 7

page 13

page 15

page 16

page 17

page 18

research
07/09/2022

Towards Multimodal Vision-Language Models Generating Non-Generic Text

Vision-language models can assess visual context in an image and generat...
research
02/23/2023

HL Dataset: Grounding High-Level Linguistic Concepts in Vision

Current captioning datasets, focus on object-centric captions, describin...
research
10/30/2020

Domain-Specific Lexical Grounding in Noisy Visual-Textual Documents

Images can give us insights into the contextual meanings of words, but c...
research
07/20/2020

Relatable Clothing: Detecting Visual Relationships between People and Clothing

Detecting visual relationships between people and clothing in an image h...
research
12/14/2022

Find Someone Who: Visual Commonsense Understanding in Human-Centric Grounding

From a visual scene containing multiple people, human is able to disting...
research
04/17/2016

Subjects and Their Objects: Localizing Interactees for a Person-Centric View of Importance

Understanding images with people often entails understanding their inter...
research
01/03/2019

A Hierarchical Grocery Store Image Dataset with Visual and Semantic Labels

Image classification models built into visual support systems and other ...

Please sign up or login with your details

Forgot password? Click here to reset