Learning Long-Term Spatial-Temporal Graphs for Active Speaker Detection

07/15/2022
by   Kyle Min, et al.
0

Active speaker detection (ASD) in videos with multiple speakers is a challenging task as it requires learning effective audiovisual features and spatial-temporal correlations over long temporal windows. In this paper, we present SPELL, a novel spatial-temporal graph learning framework that can solve complex tasks such as ASD. To this end, each person in a video frame is first encoded in a unique node for that frame. Nodes corresponding to a single person across frames are connected to encode their temporal dynamics. Nodes within a frame are also connected to encode inter-person relationships. Thus, SPELL reduces ASD to a node classification task. Importantly, SPELL is able to reason over long temporal contexts for all nodes without relying on computationally expensive fully connected graph neural networks. Through extensive experiments on the AVA-ActiveSpeaker dataset, we demonstrate that learning graph-based representations can significantly improve the active speaker detection performance owing to its explicit spatial and temporal structure. SPELL outperforms all previous state-of-the-art approaches while requiring significantly lower memory and computational resources. Our code is publicly available at https://github.com/SRA2/SPELL

READ FULL TEXT

page 6

page 19

research
12/02/2021

Learning Spatial-Temporal Graphs for Active Speaker Detection

We address the problem of active speaker detection through a new framewo...
research
01/19/2023

LoCoNet: Long-Short Context Network for Active Speaker Detection

Active Speaker Detection (ASD) aims to identify who is speaking in each ...
research
07/04/2022

GraphVid: It Only Takes a Few Nodes to Understand a Video

We propose a concise representation of videos that encode perceptually m...
research
08/20/2021

Video-based Person Re-identification with Spatial and Temporal Memory Networks

Video-based person re-identification (reID) aims to retrieve person vide...
research
08/20/2023

Coordinate Transformer: Achieving Single-stage Multi-person Mesh Recovery from Videos

Multi-person 3D mesh recovery from videos is a critical first step towar...
research
07/01/2023

Spatial-Temporal Enhanced Transformer Towards Multi-Frame 3D Object Detection

The Detection Transformer (DETR) has revolutionized the design of CNN-ba...
research
07/17/2022

E-NeRV: Expedite Neural Video Representation with Disentangled Spatial-Temporal Context

Recently, the image-wise implicit neural representation of videos, NeRV,...

Please sign up or login with your details

Forgot password? Click here to reset