WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing

10/26/2021
by   Sanyuan Chen, et al.
0

Self-supervised learning (SSL) achieves great success in speech recognition, while limited exploration has been attempted for other speech processing tasks. As speech signal contains multi-faceted information including speaker identity, paralinguistics, spoken content, etc., learning universal representations for all speech tasks is challenging. In this paper, we propose a new pre-trained model, WavLM, to solve full-stack downstream speech tasks. WavLM is built based on the HuBERT framework, with an emphasis on both spoken content modeling and speaker identity preservation. We first equip the Transformer structure with gated relative position bias to improve its capability on recognition tasks. For better speaker discrimination, we propose an utterance mixing training strategy, where additional overlapped utterances are created unsupervisely and incorporated during model training. Lastly, we scale up the training dataset from 60k hours to 94k hours. WavLM Large achieves state-of-the-art performance on the SUPERB benchmark, and brings significant improvements for various speech processing tasks on their representative benchmarks. The code and pretrained models are available at https://aka.ms/wavlm.

READ FULL TEXT
research
10/12/2021

UniSpeech-SAT: Universal Speech Representation Learning with Speaker Aware Pre-Training

Self-supervised learning (SSL) is a long-standing goal for speech proces...
research
12/16/2021

Self-Supervised Learning for speech recognition with Intermediate layer supervision

Recently, pioneer work finds that speech pre-trained models can solve fu...
research
05/25/2023

Weakly-Supervised Speech Pre-training: A Case Study on Target Speech Recognition

Self-supervised learning (SSL) based speech pre-training has attracted m...
research
11/09/2022

Speech separation with large-scale self-supervised learning

Self-supervised learning (SSL) methods such as WavLM have shown promisin...
research
08/04/2020

Intra-class variation reduction of speaker representation in disentanglement framework

In this paper, we propose an effective training strategy to ex-tract rob...
research
06/10/2022

Feature Learning and Ensemble Pre-Tasks Based Self-Supervised Speech Denoising and Dereverberation

Self-supervised learning (SSL) achieves great success in monaural speech...
research
08/22/2023

An Effective Transformer-based Contextual Model and Temporal Gate Pooling for Speaker Identification

Wav2vec2 has achieved success in applying Transformer architecture and s...

Please sign up or login with your details

Forgot password? Click here to reset