Connecting Vision and Language with Video Localized Narratives

02/22/2023
by   Paul Voigtlaender, et al.
0

We propose Video Localized Narratives, a new form of multimodal video annotations connecting vision and language. In the original Localized Narratives, annotators speak and move their mouse simultaneously on an image, thus grounding each word with a mouse trace segment. However, this is challenging on a video. Our new protocol empowers annotators to tell the story of a video with Localized Narratives, capturing even complex events involving multiple actors interacting with each other and with several passive objects. We annotated 20k videos of the OVIS, UVO, and Oops datasets, totalling 1.7M words. Based on this data, we also construct new benchmarks for the video narrative grounding and video question-answering tasks, and provide reference results from strong baseline models. Our annotations are available at https://google.github.io/video-localized-narratives/.

READ FULL TEXT

page 2

page 3

page 6

page 8

page 13

page 14

page 15

research
12/06/2019

Connecting Vision and Language with Localized Narratives

We propose Localized Narratives, an efficient way to collect image capti...
research
05/01/2020

The AVA-Kinetics Localized Human Actions Video Dataset

This paper describes the AVA-Kinetics localized human actions video data...
research
07/17/2020

Visual Relation Grounding in Videos

In this paper, we explore a novel task named visual Relation Grounding i...
research
09/21/2021

Does Vision-and-Language Pretraining Improve Lexical Grounding?

Linguistic representations derived from text alone have been criticized ...
research
09/10/2021

Panoptic Narrative Grounding

This paper proposes Panoptic Narrative Grounding, a spatially fine and g...
research
05/12/2021

Connecting What to Say With Where to Look by Modeling Human Attention Traces

We introduce a unified framework to jointly model images, text, and huma...
research
08/23/2023

Blending-NeRF: Text-Driven Localized Editing in Neural Radiance Fields

Text-driven localized editing of 3D objects is particularly difficult as...

Please sign up or login with your details

Forgot password? Click here to reset