Audiovisual Transformer Architectures for Large-Scale Classification and Synchronization of Weakly Labeled Audio Events

12/02/2019
by   Wim Boes, et al.
27

We tackle the task of environmental event classification by drawing inspiration from the transformer neural network architecture used in machine translation. We modify this attention-based feedforward structure in such a way that allows the resulting model to use audio as well as video to compute sound event predictions. We perform extensive experiments with these adapted transformers on an audiovisual data set, obtained by appending relevant visual information to an existing large-scale weakly labeled audio collection. The employed multi-label data contains clip-level annotation indicating the presence or absence of 17 classes of environmental sounds, and does not include temporal information. We show that the proposed modified transformers strongly improve upon previously introduced models and in fact achieve state-of-the-art results. We also make a compelling case for devoting more attention to research in multimodal audiovisual classification by proving the usefulness of visual information for the task at hand,namely audio event recognition. In addition, we visualize internal attention patterns of the audiovisual transformers and in doing so demonstrate their potential for performing multimodal synchronization.

READ FULL TEXT

page 2

page 3

page 6

page 7

page 8

research
10/01/2017

Large-scale weakly supervised audio classification using gated convolutional neural network

In this paper, we present a gated convolutional neural network and a tem...
research
10/27/2022

Multimodal Transformer Distillation for Audio-Visual Synchronization

Audio-visual synchronization aims to determine whether the mouth movemen...
research
03/06/2018

Multi-level Attention Model for Weakly Supervised Audio Classification

In this paper, we propose a multi-level attention model to solve the wea...
research
01/10/2019

Cosine-similarity penalty to discriminate sound classes in weakly-supervised sound event detection

The design of new methods and models when only weakly-labeled data are a...
research
05/06/2022

Robustness of Neural Architectures for Audio Event Detection

Traditionally, in Audio Recognition pipeline, noise is suppressed by the...
research
01/23/2023

Zorro: the masked multimodal transformer

Attention-based models are appealing for multimodal processing because i...
research
03/22/2021

Tiny Transformers for Environmental Sound Classification at the Edge

With the growth of the Internet of Things and the rise of Big Data, data...

Please sign up or login with your details

Forgot password? Click here to reset