Grounded Video Description

12/17/2018
by   Luowei Zhou, et al.
8

Video description is one of the most challenging problems in vision and language understanding due to the large variability both on the video and language side. Models, hence, typically shortcut the difficulty in recognition and generate plausible sentences that are based on priors but are not necessarily grounded in the video. In this work, we explicitly link the sentence to the evidence in the video by annotating each noun phrase in a sentence with the corresponding bounding box in one of the frames of a video. Our novel dataset, ActivityNet-Entities, is based on the challenging ActivityNet Captions dataset and augments it with 158k bounding box annotations, each grounding a noun phrase. This allows training video description models with this data, and importantly, evaluate how grounded or "true" such model are to the video they describe. To generate grounded captions, we propose a novel video description model which is able to exploit these bounding box annotations. We demonstrate the effectiveness of our model on our ActivityNet-Entities, but also show how it can be applied to image description on the Flickr30k Entities dataset. We achieve state-of-the-art performance on video description, video paragraph description, and image description and demonstrate our generated sentences are better grounded in the video.

READ FULL TEXT

page 1

page 3

page 13

page 14

page 15

page 16

page 18

research
12/01/2019

Learning to Relate from Captions and Bounding Boxes

In this work, we propose a novel approach that predicts the relationship...
research
03/26/2020

Grounded Situation Recognition

We introduce Grounded Situation Recognition (GSR), a task that requires ...
research
12/08/2020

A Dataset and Application for Facial Recognition of Individual Gorillas in Zoo Environments

We put forward a video dataset with 5k+ facial bounding box annotations ...
research
11/16/2017

Grounded Objects and Interactions for Video Captioning

We address the problem of video captioning by grounding language generat...
research
11/29/2019

OptiBox: Breaking the Limits of Proposals for Visual Grounding

The problem of language grounding has attracted much attention in recent...
research
10/07/2015

Resolving References to Objects in Photographs using the Words-As-Classifiers Model

A common use of language is to refer to visually present objects. Modell...
research
01/03/2020

Discoverability in Satellite Imagery: A Good Sentence is Worth a Thousand Pictures

Small satellite constellations provide daily global coverage of the eart...

Please sign up or login with your details

Forgot password? Click here to reset