Multi-task self-supervised learning for Robust Speech Recognition

01/25/2020
by   Mirco Ravanelli, et al.
1

Despite the growing interest in unsupervised learning, extracting meaningful knowledge from unlabelled audio remains an open challenge. To take a step in this direction, we recently proposed a problem-agnostic speech encoder (PASE), that combines a convolutional encoder followed by multiple neural networks, called workers, tasked to solve self-supervised problems (i.e., ones that do not require manual annotations as ground truth). PASE was shown to capture relevant speech information, including speaker voice-print and phonemes. This paper proposes PASE+, an improved version of PASE for robust speech recognition in noisy and reverberant environments. To this end, we employ an online speech distortion module, that contaminates the input signals with a variety of random disturbances. We then propose a revised encoder that better learns short- and long-term speech dynamics with an efficient combination of recurrent and convolutional networks. Finally, we refine the set of workers used in self-supervision to encourage better cooperation. Results on TIMIT, DIRHA and CHiME-5 show that PASE+ significantly outperforms both the previous version of PASE as well as common acoustic features. Interestingly, PASE+ learns transferable representations suitable for highly mismatched acoustic conditions.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/06/2019

Learning Problem-agnostic Speech Representations from Multiple Self-supervised Tasks

Learning good representations without supervision is still an open issue...
research
07/08/2020

Learning Speech Representations from Raw Audio by Joint Audiovisual Self-Supervision

The intuitive interaction between the audio and visual modalities is val...
research
12/16/2021

Self-Supervised Learning for speech recognition with Intermediate layer supervision

Recently, pioneer work finds that speech pre-trained models can solve fu...
research
02/10/2023

AV-data2vec: Self-supervised Learning of Audio-Visual Speech Representations with Contextualized Target Representations

Self-supervision has shown great potential for audio-visual speech recog...
research
02/05/2021

Multi-Task Self-Supervised Pre-Training for Music Classification

Deep learning is very data hungry, and supervised learning especially re...
research
04/23/2023

SAR: Self-Supervised Anti-Distortion Representation for End-To-End Speech Model

In recent Text-to-Speech (TTS) systems, a neural vocoder often generates...
research
10/27/2022

Opening the Black Box of wav2vec Feature Encoder

Self-supervised models, namely, wav2vec and its variants, have shown pro...

Please sign up or login with your details

Forgot password? Click here to reset