Speech-XLNet: Unsupervised Acoustic Model Pretraining For Self-Attention Networks

10/23/2019
by   Xingchen Song, et al.
0

Self-attention network (SAN) can benefit significantly from the bi-directional representation learning through unsupervised pretraining paradigms such as BERT and XLNet. In this paper, we present an XLNet-like pretraining scheme "Speech-XLNet" for unsupervised acoustic model pretraining to learn speech representations with SAN. The pretrained SAN is finetuned under the hybrid SAN/HMM framework. We conjecture that by shuffling the speech frame orders, the permutation in Speech-XLNet serves as a strong regularizer to encourage the SAN to make inferences by focusing on global structures through its attention weights. In addition, Speech-XLNet also allows the model to explore the bi-directional contexts for effective speech representation learning. Experiments on TIMIT and WSJ demonstrate that Speech-XLNet greatly improves the SAN/HMM performance in terms of both convergence speed and recognition accuracy compared to the one trained from randomly initialized weights. Our best systems achieve a relative improvement of 11.9 the TIMIT and WSJ tasks respectively. In particular, the best system achieves a phone error rate (PER) of 13.3 knowledge, is the lowest PER obtained from a single system.

READ FULL TEXT
research
03/18/2022

A^3T: Alignment-Aware Acoustic and Text Pretraining for Speech Synthesis and Editing

Recently, speech representation learning has improved many speech-relate...
research
06/24/2020

Unsupervised Cross-lingual Representation Learning for Speech Recognition

This paper presents XLSR which learns cross-lingual speech representatio...
research
08/27/2021

Injecting Text in Self-Supervised Speech Pretraining

Self-supervised pretraining for Automated Speech Recognition (ASR) has s...
research
03/19/2019

Cloze-driven Pretraining of Self-attention Networks

We present a new approach for pretraining a bi-directional transformer m...
research
02/10/2021

Fused Acoustic and Text Encoding for Multimodal Bilingual Pretraining and Speech Translation

Recently text and speech representation learning has successfully improv...
research
06/09/2022

Revisiting End-to-End Speech-to-Text Translation From Scratch

End-to-end (E2E) speech-to-text translation (ST) often depends on pretra...
research
11/24/2019

Reinventing 2D Convolutions for 3D Medical Images

There has been considerable debate over 2D and 3D representation learnin...

Please sign up or login with your details

Forgot password? Click here to reset