AST-SED: An Effective Sound Event Detection Method Based on Audio Spectrogram Transformer

03/07/2023
by   Kang Li, et al.
0

In this paper, we propose an effective sound event detection (SED) method based on the audio spectrogram transformer (AST) model, pretrained on the large-scale AudioSet for audio tagging (AT) task, termed AST-SED. Pretrained AST models have recently shown promise on DCASE2022 challenge task4 where they help mitigate a lack of sufficient real annotated data. However, mainly due to differences between the AT and SED tasks, it is suboptimal to directly utilize outputs from a pretrained AST model. Hence the proposed AST-SED adopts an encoder-decoder architecture to enable effective and efficient fine-tuning without needing to redesign or retrain the AST model. Specifically, the Frequency-wise Transformer Encoder (FTE) consists of transformers with self attention along the frequency axis to address multiple overlapped audio events issue in a single clip. The Local Gated Recurrent Units Decoder (LGD) consists of nearest-neighbor interpolation (NNI) and Bidirectional Gated Recurrent Units (Bi-GRU) to compensate for temporal resolution loss in the pretrained AST model output. Experimental results on DCASE2022 task4 development set have demonstrated the superiority of the proposed AST-SED with FTE-LGD architecture. Specifically, the Event-Based F1-score (EB-F1) of 59.60 detection Score scenario1 (PSDS1) score of 0.5140 significantly outperform CRNN and other pretrained AST-based systems.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/04/2021

Audio Captioning Using Sound Event Detection

This technical report proposes an audio captioning system for DCASE 2021...
research
09/15/2023

Fine-tune the pretrained ATST model for sound event detection

Sound event detection (SED) often suffers from the data deficiency probl...
research
02/02/2022

HTS-AT: A Hierarchical Token-Semantic Audio Transformer for Sound Classification and Detection

Audio classification is an important task of mapping audio samples into ...
research
12/10/2019

Sound Event Detection of Weakly Labelled Data with CNN-Transformer and Automatic Threshold Optimization

Sound event detection (SED) is a task to detect sound events in an audio...
research
08/10/2017

DNN and CNN with Weighted and Multi-task Loss Functions for Audio Event Detection

This report presents our audio event detection system submitted for Task...
research
05/03/2023

Learning to Detect Novel and Fine-Grained Acoustic Sequences Using Pretrained Audio Representations

This work investigates pretrained audio representations for few shot Sou...
research
07/10/2023

Automatic Piano Transcription with Hierarchical Frequency-Time Transformer

Taking long-term spectral and temporal dependencies into account is esse...

Please sign up or login with your details

Forgot password? Click here to reset