Audio-Visual Glance Network for Efficient Video Recognition

08/18/2023
by   Muhammad Adi Nugroho, et al.
0

Deep learning has made significant strides in video understanding tasks, but the computation required to classify lengthy and massive videos using clip-level video classifiers remains impractical and prohibitively expensive. To address this issue, we propose Audio-Visual Glance Network (AVGN), which leverages the commonly available audio and visual modalities to efficiently process the spatio-temporally important parts of a video. AVGN firstly divides the video into snippets of image-audio clip pair and employs lightweight unimodal encoders to extract global visual features and audio features. To identify the important temporal segments, we use an Audio-Visual Temporal Saliency Transformer (AV-TeST) that estimates the saliency scores of each frame. To further increase efficiency in the spatial dimension, AVGN processes only the important patches instead of the whole images. We use an Audio-Enhanced Spatial Patch Attention (AESPA) module to produce a set of enhanced coarse visual features, which are fed to a policy network that produces the coordinates of the important patches. This approach enables us to focus only on the most important spatio-temporally parts of the video, leading to more efficient video recognition. Moreover, we incorporate various training techniques and multi-modal feature fusion to enhance the robustness and effectiveness of our AVGN. By combining these strategies, our AVGN sets new state-of-the-art performance in multiple video recognition benchmarks while achieving faster processing speed.

READ FULL TEXT

page 3

page 4

page 8

page 14

research
01/07/2021

Audiovisual Saliency Prediction in Uncategorized Video Sequences based on Audio-Video Correlation

Substantial research has been done in saliency modeling to develop intel...
research
10/31/2016

Bi-modal First Impressions Recognition using Temporally Ordered Deep Audio and Stochastic Visual Features

We propose a novel approach for First Impressions Recognition in terms o...
research
07/03/2023

AVSegFormer: Audio-Visual Segmentation with Transformer

The combination of audio and vision has long been a topic of interest in...
research
01/30/2020

NAViDAd: A No-Reference Audio-Visual Quality Metric Based on a Deep Autoencoder

The development of models for quality prediction of both audio and video...
research
09/19/2023

FoleyGen: Visually-Guided Audio Generation

Recent advancements in audio generation have been spurred by the evoluti...
research
03/18/2021

Space-Time Crop Attend: Improving Cross-modal Video Representation Learning

The quality of the image representations obtained from self-supervised l...
research
12/23/2015

Mid-level Representation for Visual Recognition

Visual Recognition is one of the fundamental challenges in AI, where the...

Please sign up or login with your details

Forgot password? Click here to reset