Weakly Supervised Dense Video Captioning

04/05/2017
by   Zhiqiang Shen, et al.
0

This paper focuses on a novel and challenging vision task, dense video captioning, which aims to automatically describe a video clip with multiple informative and diverse caption sentences. The proposed method is trained without explicit annotation of fine-grained sentence to video region-sequence correspondence, but is only based on weak video-level sentence annotations. It differs from existing video captioning systems in three technical aspects. First, we propose lexical fully convolutional neural networks (Lexical-FCN) with weakly supervised multi-instance multi-label learning to weakly link video regions with lexical labels. Second, we introduce a novel submodular maximization scheme to generate multiple informative and diverse region-sequences based on the Lexical-FCN outputs. A winner-takes-all scheme is adopted to weakly associate sentences to region-sequences in the training phase. Third, a sequence-to-sequence learning based language model is trained with the weakly supervised information obtained through the association process. We show that the proposed method can not only produce informative and diverse dense captions, but also outperform state-of-the-art single video captioning methods by a large margin.

READ FULL TEXT

page 10

page 11

page 12

research
12/10/2018

Weakly Supervised Dense Event Captioning in Videos

Dense event captioning aims to detect and describe all events of interes...
research
02/27/2023

Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning

In this work, we introduce Vid2Seq, a multi-modal single-stage dense eve...
research
09/22/2019

Watch, Listen and Tell: Multi-modal Weakly Supervised Dense Event Captioning

Multi-modal learning, particularly among imaging and linguistic modaliti...
research
03/22/2023

Weakly Supervised Video Representation Learning with Unaligned Text for Sequential Videos

Sequential video understanding, as an emerging video understanding task,...
research
06/21/2013

Discriminative Training: Learning to Describe Video with Sentences, from Video Described with Sentences

We present a method for learning word meanings from complex and realisti...
research
08/07/2017

PPR-FCN: Weakly Supervised Visual Relation Detection via Parallel Pairwise R-FCN

We aim to tackle a novel vision task called Weakly Supervised Visual Rel...
research
05/04/2023

Weakly-supervised Micro- and Macro-expression Spotting Based on Multi-level Consistency

Most micro- and macro-expression spotting methods in untrimmed videos su...

Please sign up or login with your details

Forgot password? Click here to reset