Audiovisual SlowFast Networks for Video Recognition

01/23/2020
by   Fanyi Xiao, et al.
23

We present Audiovisual SlowFast Networks, an architecture for integrated audiovisual perception. AVSlowFast extends SlowFast Networks with a Faster Audio pathway that is deeply integrated with its visual counterparts. We fuse audio and visual features at multiple layers, enabling audio to contribute to the formation of hierarchical audiovisual concepts. To overcome training difficulties that arise from different learning dynamics for audio and visual modalities, we employ DropPathway that randomly drops the Audio pathway during training as a simple and effective regularization technique. Inspired by prior studies in neuroscience, we perform hierarchical audiovisual synchronization and show that it leads to better audiovisual features. We report state-of-the-art results on four video action classification and detection datasets, perform detailed ablation studies, and show the generalization of AVSlowFast to self-supervised tasks, where it improves over prior work. Code will be made available at: https://github.com/facebookresearch/SlowFast.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/12/2021

Multi-level Attention Fusion Network for Audio-visual Event Recognition

Event classification is inherently sequential and multimodal. Therefore,...
research
06/30/2018

Co-Training of Audio and Video Representations from Self-Supervised Temporal Synchronization

There is a natural correlation between the visual and auditive elements ...
research
07/03/2023

AVSegFormer: Audio-Visual Segmentation with Transformer

The combination of audio and vision has long been a topic of interest in...
research
12/27/2021

Weakly Supervised Visual-Auditory Saliency Detection with Multigranularity Perception

Thanks to the rapid advances in deep learning techniques and the wide av...
research
08/21/2023

Audio-Visual Class-Incremental Learning

In this paper, we introduce audio-visual class-incremental learning, a c...
research
10/17/2022

An Open-source Benchmark of Deep Learning Models for Audio-visual Apparent and Self-reported Personality Recognition

Personality is crucial for understanding human internal and external sta...
research
08/22/2022

Examining Audio Communication Mechanisms for Supervising Fleets of Agricultural Robots

Agriculture is facing a labor crisis, leading to increased interest in f...

Please sign up or login with your details

Forgot password? Click here to reset