Grounded Video Description

12/17/2018
by   Luowei Zhou, et al.
University of Michigan
Facebook
8

Video description is one of the most challenging problems in vision and language understanding due to the large variability both on the video and language side. Models, hence, typically shortcut the difficulty in recognition and generate plausible sentences that are based on priors but are not necessarily grounded in the video. In this work, we explicitly link the sentence to the evidence in the video by annotating each noun phrase in a sentence with the corresponding bounding box in one of the frames of a video. Our novel dataset, ActivityNet-Entities, is based on the challenging ActivityNet Captions dataset and augments it with 158k bounding box annotations, each grounding a noun phrase. This allows training video description models with this data, and importantly, evaluate how grounded or "true" such model are to the video they describe. To generate grounded captions, we propose a novel video description model which is able to exploit these bounding box annotations. We demonstrate the effectiveness of our model on our ActivityNet-Entities, but also show how it can be applied to image description on the Flickr30k Entities dataset. We achieve state-of-the-art performance on video description, video paragraph description, and image description and demonstrate our generated sentences are better grounded in the video.

READ FULL TEXT

page 1

page 3

page 13

page 14

page 15

page 16

page 18

12/01/2019

Learning to Relate from Captions and Bounding Boxes

In this work, we propose a novel approach that predicts the relationship...
03/26/2020

Grounded Situation Recognition

We introduce Grounded Situation Recognition (GSR), a task that requires ...
12/08/2020

A Dataset and Application for Facial Recognition of Individual Gorillas in Zoo Environments

We put forward a video dataset with 5k+ facial bounding box annotations ...
11/16/2017

Grounded Objects and Interactions for Video Captioning

We address the problem of video captioning by grounding language generat...
11/29/2019

OptiBox: Breaking the Limits of Proposals for Visual Grounding

The problem of language grounding has attracted much attention in recent...
10/07/2015

Resolving References to Objects in Photographs using the Words-As-Classifiers Model

A common use of language is to refer to visually present objects. Modell...
01/03/2020

Discoverability in Satellite Imagery: A Good Sentence is Worth a Thousand Pictures

Small satellite constellations provide daily global coverage of the eart...

Code Repositories

grounded-video-description

Video Grounding and Captioning


view repo

ActivityNet-Entities

A Dataset for Grounded Video Description


view repo

Please sign up or login with your details

Forgot password? Click here to reset