DeepAI AI Chat
Log In Sign Up

Speech-XLNet: Unsupervised Acoustic Model Pretraining For Self-Attention Networks

by   Xingchen Song, et al.

Self-attention network (SAN) can benefit significantly from the bi-directional representation learning through unsupervised pretraining paradigms such as BERT and XLNet. In this paper, we present an XLNet-like pretraining scheme "Speech-XLNet" for unsupervised acoustic model pretraining to learn speech representations with SAN. The pretrained SAN is finetuned under the hybrid SAN/HMM framework. We conjecture that by shuffling the speech frame orders, the permutation in Speech-XLNet serves as a strong regularizer to encourage the SAN to make inferences by focusing on global structures through its attention weights. In addition, Speech-XLNet also allows the model to explore the bi-directional contexts for effective speech representation learning. Experiments on TIMIT and WSJ demonstrate that Speech-XLNet greatly improves the SAN/HMM performance in terms of both convergence speed and recognition accuracy compared to the one trained from randomly initialized weights. Our best systems achieve a relative improvement of 11.9 the TIMIT and WSJ tasks respectively. In particular, the best system achieves a phone error rate (PER) of 13.3 knowledge, is the lowest PER obtained from a single system.


A^3T: Alignment-Aware Acoustic and Text Pretraining for Speech Synthesis and Editing

Recently, speech representation learning has improved many speech-relate...

Unsupervised Cross-lingual Representation Learning for Speech Recognition

This paper presents XLSR which learns cross-lingual speech representatio...

Injecting Text in Self-Supervised Speech Pretraining

Self-supervised pretraining for Automated Speech Recognition (ASR) has s...

Cloze-driven Pretraining of Self-attention Networks

We present a new approach for pretraining a bi-directional transformer m...

Revisiting End-to-End Speech-to-Text Translation From Scratch

End-to-end (E2E) speech-to-text translation (ST) often depends on pretra...

Reinventing 2D Convolutions for 3D Medical Images

There has been considerable debate over 2D and 3D representation learnin...

Tokenwise Contrastive Pretraining for Finer Speech-to-BERT Alignment in End-to-End Speech-to-Intent Systems

Recent advances in End-to-End (E2E) Spoken Language Understanding (SLU) ...