MEGA: Multimodal Alignment Aggregation and Distillation For Cinematic Video Segmentation

08/22/2023
by   Najmeh Sadoughi, et al.
0

Previous research has studied the task of segmenting cinematic videos into scenes and into narrative acts. However, these studies have overlooked the essential task of multimodal alignment and fusion for effectively and efficiently processing long-form videos (>60min). In this paper, we introduce Multimodal alignmEnt aGgregation and distillAtion (MEGA) for cinematic long-video segmentation. MEGA tackles the challenge by leveraging multiple media modalities. The method coarsely aligns inputs of variable lengths and different modalities with alignment positional encoding. To maintain temporal synchronization while reducing computation, we further introduce an enhanced bottleneck fusion layer which uses temporal alignment. Additionally, MEGA employs a novel contrastive loss to synchronize and transfer labels across modalities, enabling act segmentation from labeled synopsis sentences on video shots. Our experimental results show that MEGA outperforms state-of-the-art methods on MovieNet dataset for scene segmentation (with an Average Precision improvement of +1.19 Agreement improvement of +5.51

READ FULL TEXT

page 4

page 10

page 13

page 14

page 15

page 16

research
07/26/2021

Temporal Alignment Prediction for Few-Shot Video Classification

The goal of few-shot video classification is to learn a classification m...
research
11/22/2022

Domain Alignment and Temporal Aggregation for Unsupervised Video Object Segmentation

Unsupervised video object segmentation aims at detecting and segmenting ...
research
10/12/2022

LiveSeg: Unsupervised Multimodal Temporal Segmentation of Long Livestream Videos

Livestream videos have become a significant part of online learning, whe...
research
06/29/2023

Alternative Telescopic Displacement: An Efficient Multimodal Alignment Method

Feature alignment is the primary means of fusing multimodal data. We pro...
research
04/05/2020

Deep Multimodal Feature Encoding for Video Ordering

True understanding of videos comes from a joint analysis of all its moda...
research
11/15/2020

Data-efficient Alignment of Multimodal Sequences by Aligning Gradient Updates and Internal Feature Distributions

The task of video and text sequence alignment is a prerequisite step tow...
research
08/20/2021

Video Ads Content Structuring by Combining Scene Confidence Prediction and Tagging

Video ads segmentation and tagging is a challenging task due to two main...

Please sign up or login with your details

Forgot password? Click here to reset