Weakly Supervised Dense Video Captioning via Jointly Usage of Knowledge Distillation and Cross-modal Matching

05/18/2021
by   Bofeng Wu, et al.
0

This paper proposes an approach to Dense Video Captioning (DVC) without pairwise event-sentence annotation. First, we adopt the knowledge distilled from relevant and well solved tasks to generate high-quality event proposals. Then we incorporate contrastive loss and cycle-consistency loss typically applied to cross-modal retrieval tasks to build semantic matching between the proposals and sentences, which are eventually used to train the caption generation module. In addition, the parameters of matching module are initialized via pre-training based on annotated images to improve the matching performance. Extensive experiments on ActivityNet-Caption dataset reveal the significance of distillation-based event proposal generation and cross-modal retrieval-based semantic matching to weakly supervised DVC, and demonstrate the superiority of our method to existing state-of-the-art methods.

READ FULL TEXT

page 3

page 6

research
04/16/2023

Robust Cross-Modal Knowledge Distillation for Unconstrained Videos

Cross-modal distillation has been widely used to transfer knowledge acro...
research
11/24/2019

A Proposal-based Approach for Activity Image-to-Video Retrieval

Activity image-to-video retrieval task aims to retrieve videos containin...
research
12/14/2021

CoCo-BERT: Improving Video-Language Pre-training with Contrastive Cross-modal Matching and Denoising

BERT-type structure has led to the revolution of vision-language pre-tra...
research
10/11/2022

Leveraging the Video-level Semantic Consistency of Event for Audio-visual Event Localization

Audio-visual event localization has attracted much attention in recent y...
research
12/10/2018

Weakly Supervised Dense Event Captioning in Videos

Dense event captioning aims to detect and describe all events of interes...
research
03/11/2023

Learning Grounded Vision-Language Representation for Versatile Understanding in Untrimmed Videos

Joint video-language learning has received increasing attention in recen...
research
08/31/2023

3D-STMN: Dependency-Driven Superpoint-Text Matching Network for End-to-End 3D Referring Expression Segmentation

In 3D Referring Expression Segmentation (3D-RES), the earlier approach a...

Please sign up or login with your details

Forgot password? Click here to reset