Grounded Video Description

by   Luowei Zhou, et al.
University of Michigan

Video description is one of the most challenging problems in vision and language understanding due to the large variability both on the video and language side. Models, hence, typically shortcut the difficulty in recognition and generate plausible sentences that are based on priors but are not necessarily grounded in the video. In this work, we explicitly link the sentence to the evidence in the video by annotating each noun phrase in a sentence with the corresponding bounding box in one of the frames of a video. Our novel dataset, ActivityNet-Entities, is based on the challenging ActivityNet Captions dataset and augments it with 158k bounding box annotations, each grounding a noun phrase. This allows training video description models with this data, and importantly, evaluate how grounded or "true" such model are to the video they describe. To generate grounded captions, we propose a novel video description model which is able to exploit these bounding box annotations. We demonstrate the effectiveness of our model on our ActivityNet-Entities, but also show how it can be applied to image description on the Flickr30k Entities dataset. We achieve state-of-the-art performance on video description, video paragraph description, and image description and demonstrate our generated sentences are better grounded in the video.


page 1

page 3

page 13

page 14

page 15

page 16

page 18


Learning to Relate from Captions and Bounding Boxes

In this work, we propose a novel approach that predicts the relationship...

Grounded Situation Recognition

We introduce Grounded Situation Recognition (GSR), a task that requires ...

A Dataset and Application for Facial Recognition of Individual Gorillas in Zoo Environments

We put forward a video dataset with 5k+ facial bounding box annotations ...

Grounded Objects and Interactions for Video Captioning

We address the problem of video captioning by grounding language generat...

OptiBox: Breaking the Limits of Proposals for Visual Grounding

The problem of language grounding has attracted much attention in recent...

Resolving References to Objects in Photographs using the Words-As-Classifiers Model

A common use of language is to refer to visually present objects. Modell...

Discoverability in Satellite Imagery: A Good Sentence is Worth a Thousand Pictures

Small satellite constellations provide daily global coverage of the eart...

Code Repositories


Video Grounding and Captioning

view repo


A Dataset for Grounded Video Description

view repo

Please sign up or login with your details

Forgot password? Click here to reset