Perceptual Loss based Speech Denoising with an ensemble of Audio Pattern Recognition and Self-Supervised Models

10/22/2020
by   Saurabh Kataria, et al.
0

Deep learning based speech denoising still suffers from the challenge of improving perceptual quality of enhanced signals. We introduce a generalized framework called Perceptual Ensemble Regularization Loss (PERL) built on the idea of perceptual losses. Perceptual loss discourages distortion to certain speech properties and we analyze it using six large-scale pre-trained models: speaker classification, acoustic model, speaker embedding, emotion classification, and two self-supervised speech encoders (PASE+, wav2vec 2.0). We first build a strong baseline (w/o PERL) using Conformer Transformer Networks on the popular enhancement benchmark called VCTK-DEMAND. Using auxiliary models one at a time, we find acoustic event and self-supervised model PASE+ to be most effective. Our best model (PERL-AE) only uses acoustic event model (utilizing AudioSet) to outperform state-of-the-art methods on major perceptual metrics. To explore if denoising can leverage full framework, we use all networks but find that our seven-loss formulation suffers from the challenges of Multi-Task Learning. Finally, we report a critical observation that state-of-the-art Multi-Task weight learning methods cannot outperform hand tuning, perhaps due to challenges of domain mismatch and weak complementarity of losses.

READ FULL TEXT
research
10/03/2021

Multi-task Voice Activated Framework using Self-supervised Learning

Self-supervised learning methods such as wav2vec 2.0 have shown promisin...
research
11/16/2020

End-to-end spoken language understanding using transformer networks and self-supervised pre-trained features

Transformer networks and self-supervised pre-training have consistently ...
research
01/07/2021

Contextual Classification Using Self-Supervised Auxiliary Models for Deep Neural Networks

Classification problems solved with deep neural networks (DNNs) typicall...
research
02/05/2021

Multi-Task Self-Supervised Pre-Training for Music Classification

Deep learning is very data hungry, and supervised learning especially re...
research
10/23/2019

End-to-End Multi-Task Denoising for the Joint Optimization of Perceptual Speech Metrics

Although supervised learning based on a deep neural network has recently...
research
06/27/2018

Speech Denoising with Deep Feature Losses

We present an end-to-end deep learning approach to denoising speech sign...
research
05/01/2023

Joint tone mapping and denoising of thermal infrared images via multi-scale Retinex and multi-task learning

Cameras digitize real-world scenes as pixel intensity values with a limi...

Please sign up or login with your details

Forgot password? Click here to reset