Learning Grounded Vision-Language Representation for Versatile Understanding in Untrimmed Videos

03/11/2023
by   Teng Wang, et al.
0

Joint video-language learning has received increasing attention in recent years. However, existing works mainly focus on single or multiple trimmed video clips (events), which makes human-annotated event boundaries necessary during inference. To break away from the ties, we propose a grounded vision-language learning framework for untrimmed videos, which automatically detects informative events and effectively excavates the alignments between multi-sentence descriptions and corresponding event segments. Instead of coarse-level video-language alignments, we present two dual pretext tasks to encourage fine-grained segment-level alignments, i.e., text-to-event grounding (TEG) and event-to-text generation (ETG). TEG learns to adaptively ground the possible event proposals given a set of sentences by estimating the cross-modal distance in a joint semantic space. Meanwhile, ETG aims to reconstruct (generate) the matched texts given event proposals, encouraging the event representation to retain meaningful semantic information. To encourage accurate label assignment between the event set and the text set, we propose a novel semantic-aware cost to mitigate the sub-optimal matching results caused by ambiguous boundary annotations. Our framework is easily extensible to tasks covering visually-grounded language understanding and generation. We achieve state-of-the-art dense video captioning performance on ActivityNet Captions, YouCook2 and YouMakeup, and competitive performance on several other language generation and understanding tasks. Our method also achieved 1st place in both the MTVG and MDVC tasks of the PIC 4th Challenge.

READ FULL TEXT

page 9

page 14

research
02/28/2018

Joint Event Detection and Description in Continuous Video Streams

As a fine-grained video understanding task, dense video captioning invol...
research
05/22/2023

GEST: the Graph of Events in Space and Time as a Common Representation between Vision and Language

One of the essential human skills is the ability to seamlessly build an ...
research
04/08/2019

Streamlined Dense Video Captioning

Dense video captioning is an extremely challenging task since accurate a...
research
04/23/2018

Jointly Localizing and Describing Events for Dense Video Captioning

Automatically describing a video with natural language is regarded as a ...
research
05/18/2021

Weakly Supervised Dense Video Captioning via Jointly Usage of Knowledge Distillation and Cross-modal Matching

This paper proposes an approach to Dense Video Captioning (DVC) without ...
research
04/01/2022

Generic Event Boundary Captioning: A Benchmark for Status Changes Understanding

Cognitive science has shown that humans perceive videos in terms of even...
research
01/27/2023

Semi-Parametric Video-Grounded Text Generation

Efficient video-language modeling should consider the computational cost...

Please sign up or login with your details

Forgot password? Click here to reset