Self-supervised Audio Teacher-Student Transformer for Both Clip-level and Frame-level Tasks

06/07/2023
by   Xian Li, et al.
0

In recent years, self-supervised learning (SSL) has emerged as a popular approach for learning audio representations. The ultimate goal of audio self-supervised pre-training is to transfer knowledge to downstream audio tasks, generally including clip-level and frame-level tasks. Clip-level tasks classify the scene or sound of an entire audio clip, e.g. audio tagging, instrument recognition, etc. While frame-level tasks detect event-level timestamps from an audio clip, e.g. sound event detection, speaker diarization, etc. Prior studies primarily evaluate on clip-level downstream tasks. Frame-level tasks are important for fine-grained acoustic scene/event understanding, and are generally more challenging than clip-level tasks. In order to tackle both clip-level and frame-level tasks, this paper proposes two self-supervised audio representation learning methods: ATST-Clip and ATST-Frame, responsible for learning clip-level and frame-level representations, respectively. ATST stands for Audio Teacher-Student Transformer, which means both methods use a transformer encoder and a teacher-student training scheme.Experimental results show that our ATST-Frame model obtains state-of-the-art (SOTA) performance on most of the clip-level and frame-level downstream tasks. Especially, it outperforms other models by a large margin on the frame-level sound event detection task. In addition, the performance can be further improved by combining the two models through knowledge distillation.

READ FULL TEXT
research
04/26/2022

ATST: Audio Representation Learning with Teacher-Student Transformer

Self-supervised learning (SSL) learns knowledge from a large amount of u...
research
10/18/2022

A Hybrid System of Sound Event Detection Transformer and Frame-wise Model for DCASE 2022 Task 4

In this paper, we describe in detail our system for DCASE 2022 Task4. Th...
research
07/29/2021

Fine-Grained Classroom Activity Detection from Audio with Neural Networks

Instructors are increasingly incorporating student-centered learning tec...
research
12/06/2022

Self-Supervised Audio-Visual Speech Representations Learning By Multimodal Self-Distillation

In this work, we present a novel method, named AV2vec, for learning audi...
research
05/31/2023

MERT: Acoustic Music Understanding Model with Large-Scale Self-supervised Training

Self-supervised learning (SSL) has recently emerged as a promising parad...
research
05/31/2021

Multi-Scale Temporal Convolution Network for Classroom Voice Detection

Teaching with the cooperation of expert teacher and assistant teacher, w...
research
01/04/2022

Sound and Visual Representation Learning with Multiple Pretraining Tasks

Different self-supervised tasks (SSL) reveal different features from the...

Please sign up or login with your details

Forgot password? Click here to reset