Extending Audio Masked Autoencoders Toward Audio Restoration

05/11/2023
by   Zhi Zhong, et al.
0

Audio classification and restoration are among major downstream tasks in audio signal processing. However, restoration derives less of a benefit from pretrained models compared to the overwhelming success of pretrained models in classification tasks. Due to such unbalanced benefits, there has been rising interest in how to improve the performance of pretrained models for restoration tasks such as speech enhancement (SE). Previous works have shown that the features extracted by pretrained audio encoders are effective for SE tasks, but these speech-specific encoder-only models usually require extra decoders to become compatible with SE tasks, and involve complicated pretraining procedures or complex data augmentation. Therefore, in pursuit of a universal audio model, the audio masked autoencoder (MAE) whose backbone is the autoencoder of Vision Transformers (ViT-AE), is extended from audio classification toward restoration tasks in this paper. ViT-AE naturally learns mel-to-mel mapping that is compatible with restoration tasks during pretraining. Among many restoration tasks, SE is chosen due to its well-established evaluation metrics and test data. We propose variations of ViT-AE to improve the SE performance, where the mel-to-mel variations yield high scores for non-intrusive metrics and the STFT-oriented variation is effective at standard intrusive metrics such as PESQ. Different variations can be used in accordance with the scenarios. Comprehensive evaluations and ablation studies show that MAE pretraining is also beneficial to SE tasks and help the ViT-AE to better generalize to out-of-domain distortions. We further found that large-scale noisy data of general audio sources, rather than clean speech, is sufficiently effective for pretraining.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/23/2022

ASiT: Audio Spectrogram vIsion Transformer for General Audio Representation

Vision transformers, which were originally developed for natural languag...
research
05/24/2023

Downstream Task Agnostic Speech Enhancement with Self-Supervised Representation Loss

Self-supervised learning (SSL) is the latest breakthrough in speech proc...
research
09/01/2017

Audio-Visual Speech Enhancement based on Multimodal Deep Convolutional Neural Network

Speech enhancement (SE) aims to reduce noise in speech signals. Most SE ...
research
10/29/2019

Depa: Self-supervised audio embedding for depression detection

Depression detection research has increased over the last few decades as...
research
07/12/2023

What Happens During Finetuning of Vision Transformers: An Invariance Based Investigation

The pretrain-finetune paradigm usually improves downstream performance o...
research
12/15/2022

Vision Transformers are Parameter-Efficient Audio-Visual Learners

Vision transformers (ViTs) have achieved impressive results on various c...
research
08/27/2021

Task-aware Warping Factors in Mask-based Speech Enhancement

This paper proposes the use of two task-aware warping factors in mask-ba...

Please sign up or login with your details

Forgot password? Click here to reset