Weakly Supervised Video Representation Learning with Unaligned Text for Sequential Videos

03/22/2023
by   Sixun Dong, et al.
0

Sequential video understanding, as an emerging video understanding task, has driven lots of researchers' attention because of its goal-oriented nature. This paper studies weakly supervised sequential video understanding where the accurate time-stamp level text-video alignment is not provided. We solve this task by borrowing ideas from CLIP. Specifically, we use a transformer to aggregate frame-level features for video representation and use a pre-trained text encoder to encode the texts corresponding to each action and the whole video, respectively. To model the correspondence between text and video, we propose a multiple granularity loss, where the video-paragraph contrastive loss enforces matching between the whole video and the complete script, and a fine-grained frame-sentence contrastive loss enforces the matching between each action and its description. As the frame-sentence correspondence is not available, we propose to use the fact that video actions happen sequentially in the temporal domain to generate pseudo frame-sentence correspondence and supervise the network training with the pseudo labels. Extensive experiments on video sequence verification and text-to-video matching show that our method outperforms baselines by a large margin, which validates the effectiveness of our proposed approach. Code is available at https://github.com/svip-lab/WeakSVR

READ FULL TEXT

page 2

page 4

page 5

page 8

research
12/06/2022

Self-supervised and Weakly Supervised Contrastive Learning for Frame-wise Action Representations

Previous work on action representation learning focused on global repres...
research
03/28/2022

Frame-wise Action Representations for Long Videos via Sequence Contrastive Learning

Prior works on action representation learning mainly focus on designing ...
research
02/08/2023

Weakly-supervised Representation Learning for Video Alignment and Analysis

Many tasks in video analysis and understanding boil down to the need for...
research
06/21/2022

Bi-Calibration Networks for Weakly-Supervised Video Representation Learning

The leverage of large volumes of web videos paired with the searched que...
research
10/21/2021

Video and Text Matching with Conditioned Embeddings

We present a method for matching a text sentence from a given corpus to ...
research
11/18/2022

Contrastive Losses Are Natural Criteria for Unsupervised Video Summarization

Video summarization aims to select the most informative subset of frames...
research
04/05/2017

Weakly Supervised Dense Video Captioning

This paper focuses on a novel and challenging vision task, dense video c...

Please sign up or login with your details

Forgot password? Click here to reset