Long-range Multimodal Pretraining for Movie Understanding

08/18/2023
by   Dawit Mureja Argaw, et al.
0

Learning computer vision models from (and for) movies has a long-standing history. While great progress has been attained, there is still a need for a pretrained multimodal model that can perform well in the ever-growing set of movie understanding tasks the community has been establishing. In this work, we introduce Long-range Multimodal Pretraining, a strategy, and a model that leverages movie data to train transferable multimodal and cross-modal encoders. Our key idea is to learn from all modalities in a movie by observing and extracting relationships over a long-range. After pretraining, we run ablation studies on the LVU benchmark and validate our modeling choices and the importance of learning from long-range time spans. Our model achieves state-of-the-art on several LVU tasks while being much more data efficient than previous works. Finally, we evaluate our model's transferability by setting a new state-of-the-art in five different benchmarks.

READ FULL TEXT

page 3

page 5

research
05/03/2022

i-Code: An Integrative and Composable Multimodal Learning Framework

Human intelligence is multimodal; we integrate visual, linguistic, and a...
research
06/26/2023

Understanding In-Context Learning via Supportive Pretraining Data

In-context learning (ICL) improves language models' performance on a var...
research
11/15/2020

Data-efficient Alignment of Multimodal Sequences by Aligning Gradient Updates and Internal Feature Distributions

The task of video and text sequence alignment is a prerequisite step tow...
research
08/30/2021

LOT: A Benchmark for Evaluating Chinese Long Text Understanding and Generation

Standard multi-task benchmarks are essential for driving the progress of...
research
06/04/2023

MoviePuzzle: Visual Narrative Reasoning through Multimodal Order Learning

We introduce MoviePuzzle, a novel challenge that targets visual narrativ...
research
08/17/2021

Graph Capsule Aggregation for Unaligned Multimodal Sequences

Humans express their opinions and emotions through multiple modalities w...
research
04/03/2020

TimeGate: Conditional Gating of Segments in Long-range Activities

When recognizing a long-range activity, exploring the entire video is ex...

Please sign up or login with your details

Forgot password? Click here to reset