Attentional Graph Convolutional Network for Structure-aware Audio-Visual Scene Classification

12/31/2022
by   Liguang Zhou, et al.
0

Audio-Visual scene understanding is a challenging problem due to the unstructured spatial-temporal relations that exist in the audio signals and spatial layouts of different objects and various texture patterns in the visual images. Recently, many studies have focused on abstracting features from convolutional neural networks while the learning of explicit semantically relevant frames of sound signals and visual images has been overlooked. To this end, we present an end-to-end framework, namely attentional graph convolutional network (AGCN), for structure-aware audio-visual scene representation. First, the spectrogram of sound and input image is processed by a backbone network for feature extraction. Then, to build multi-scale hierarchical information of input features, we utilize an attention fusion mechanism to aggregate features from multiple layers of the backbone network. Notably, to well represent the salient regions and contextual information of audio-visual inputs, the salient acoustic graph (SAG) and contextual acoustic graph (CAG), salient visual graph (SVG), and contextual visual graph (CVG) are constructed for the audio-visual scene representation. Finally, the constructed graphs pass through a graph convolutional network for structure-aware audio-visual scene recognition. Extensive experimental results on the audio, visual and audio-visual scene recognition datasets show that promising results have been achieved by the AGCN methods. Visualizing graphs on the spectrograms and images have been presented to show the effectiveness of proposed CAG/SAG and CVG/SVG that could focus on the salient and semantic relevant regions.

READ FULL TEXT

page 2

page 3

page 4

page 7

page 8

page 9

page 10

page 11

research
05/28/2022

Feature Pyramid Attention based Residual Neural Network for Environmental Sound Classification

Environmental sound classification (ESC) is a challenging problem due to...
research
01/06/2019

Enhancing Sound Texture in CNN-Based Acoustic Scene Classification

Acoustic scene classification is the task of identifying the scene from ...
research
10/29/2022

Learning Audio-Visual Dynamics Using Scene Graphs for Audio Source Separation

There exists an unequivocal distinction between the sound produced by a ...
research
03/13/2020

Adaptive Graph Convolutional Network with Attention Graph Clustering for Co-saliency Detection

Co-saliency detection aims to discover the common and salient foreground...
research
12/09/2022

Masked Lip-Sync Prediction by Audio-Visual Contextual Exploitation in Transformers

Previous studies have explored generating accurately lip-synced talking ...
research
09/16/2019

Acoustic scene analysis with multi-head attention networks

Acoustic Scene Classification (ASC) is a challenging task, as a single s...
research
11/21/2022

LISA: Localized Image Stylization with Audio via Implicit Neural Representation

We present a novel framework, Localized Image Stylization with Audio (LI...

Please sign up or login with your details

Forgot password? Click here to reset