Learning Modality Interaction for Temporal Sentence Localization and Event Captioning in Videos

07/28/2020
by   Shaoxiang Chen, et al.
5

Automatically generating sentences to describe events and temporally localizing sentences in a video are two important tasks that bridge language and videos. Recent techniques leverage the multimodal nature of videos by using off-the-shelf features to represent videos, but interactions between modalities are rarely explored. Inspired by the fact that there exist cross-modal interactions in the human brain, we propose a novel method for learning pairwise modality interactions in order to better exploit complementary information for each pair of modalities in videos and thus improve performances on both tasks. We model modality interaction in both the sequence and channel levels in a pairwise fashion, and the pairwise interaction also provides some explainability for the predictions of target tasks. We demonstrate the effectiveness of our method and validate specific design choices through extensive ablation studies. Our method turns out to achieve state-of-the-art performances on four standard benchmark datasets: MSVD and MSR-VTT (event captioning task), and Charades-STA and ActivityNet Captions (temporal sentence localization task).

READ FULL TEXT

page 13

page 14

research
12/07/2018

An Attempt towards Interpretable Audio-Visual Video Captioning

Automatically generating a natural language sentence to describe the con...
research
02/27/2023

Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning

In this work, we introduce Vid2Seq, a multi-modal single-stage dense eve...
research
08/25/2016

Title Generation for User Generated Videos

A great video title describes the most salient event compactly and captu...
research
06/26/2022

VLCap: Vision-Language with Contrastive Learning for Coherent Video Paragraph Captioning

In this paper, we leverage the human perceiving process, that involves v...
research
12/10/2018

Weakly Supervised Dense Event Captioning in Videos

Dense event captioning aims to detect and describe all events of interes...
research
06/25/2018

Best Vision Technologies Submission to ActivityNet Challenge 2018-Task: Dense-Captioning Events in Videos

This note describes the details of our solution to the dense-captioning ...
research
11/01/2019

Low-Rank HOCA: Efficient High-Order Cross-Modal Attention for Video Captioning

This paper addresses the challenging task of video captioning which aims...

Please sign up or login with your details

Forgot password? Click here to reset