Efficient Movie Scene Detection using State-Space Transformers

12/29/2022
by   Md Mohaiminul Islam, et al.
0

The ability to distinguish between different movie scenes is critical for understanding the storyline of a movie. However, accurately detecting movie scenes is often challenging as it requires the ability to reason over very long movie segments. This is in contrast to most existing video recognition models, which are typically designed for short-range video analysis. This work proposes a State-Space Transformer model that can efficiently capture dependencies in long movie videos for accurate movie scene detection. Our model, dubbed TranS4mer, is built using a novel S4A building block, which combines the strengths of structured state-space sequence (S4) and self-attention (A) layers. Given a sequence of frames divided into movie shots (uninterrupted periods where the camera position does not change), the S4A block first applies self-attention to capture short-range intra-shot dependencies. Afterward, the state-space operation in the S4A block is used to aggregate long-range inter-shot cues. The final TranS4mer model, which can be trained end-to-end, is obtained by stacking the S4A blocks one after the other multiple times. Our proposed TranS4mer outperforms all prior methods in three movie scene detection datasets, including MovieNet, BBC, and OVSD, while also being 2× faster and requiring 3× less GPU memory than standard Transformer models. We will release our code and models.

READ FULL TEXT

page 2

page 4

research
04/04/2022

Long Movie Clip Classification with State-Space Video Models

Most modern video recognition models are designed to operate on short vi...
research
06/27/2022

Long Range Language Modeling via Gated State Spaces

State space models have shown to be effective at modeling long range dep...
research
06/05/2019

Detecting Kissing Scenes in a Database of Hollywood Films

Detecting scene types in a movie can be very useful for application such...
research
04/03/2018

Bi-Directional Block Self-Attention for Fast and Memory-Efficient Sequence Modeling

Recurrent neural networks (RNN), convolutional neural networks (CNN) and...
research
06/04/2023

MoviePuzzle: Visual Narrative Reasoning through Multimodal Order Learning

We introduce MoviePuzzle, a novel challenge that targets visual narrativ...
research
10/18/2018

Convolutional Collaborative Filter Network for Video Based Recommendation Systems

This analysis explores the temporal sequencing of objects in a movie tra...
research
03/25/2023

Selective Structured State-Spaces for Long-Form Video Understanding

Effective modeling of complex spatiotemporal dependencies in long-form v...

Please sign up or login with your details

Forgot password? Click here to reset