DeepAI AI Chat
Log In Sign Up

Speech-XLNet: Unsupervised Acoustic Model Pretraining For Self-Attention Networks

10/23/2019
by   Xingchen Song, et al.
0

Self-attention network (SAN) can benefit significantly from the bi-directional representation learning through unsupervised pretraining paradigms such as BERT and XLNet. In this paper, we present an XLNet-like pretraining scheme "Speech-XLNet" for unsupervised acoustic model pretraining to learn speech representations with SAN. The pretrained SAN is finetuned under the hybrid SAN/HMM framework. We conjecture that by shuffling the speech frame orders, the permutation in Speech-XLNet serves as a strong regularizer to encourage the SAN to make inferences by focusing on global structures through its attention weights. In addition, Speech-XLNet also allows the model to explore the bi-directional contexts for effective speech representation learning. Experiments on TIMIT and WSJ demonstrate that Speech-XLNet greatly improves the SAN/HMM performance in terms of both convergence speed and recognition accuracy compared to the one trained from randomly initialized weights. Our best systems achieve a relative improvement of 11.9 the TIMIT and WSJ tasks respectively. In particular, the best system achieves a phone error rate (PER) of 13.3 knowledge, is the lowest PER obtained from a single system.

READ FULL TEXT
03/18/2022

A^3T: Alignment-Aware Acoustic and Text Pretraining for Speech Synthesis and Editing

Recently, speech representation learning has improved many speech-relate...
06/24/2020

Unsupervised Cross-lingual Representation Learning for Speech Recognition

This paper presents XLSR which learns cross-lingual speech representatio...
08/27/2021

Injecting Text in Self-Supervised Speech Pretraining

Self-supervised pretraining for Automated Speech Recognition (ASR) has s...
03/19/2019

Cloze-driven Pretraining of Self-attention Networks

We present a new approach for pretraining a bi-directional transformer m...
06/09/2022

Revisiting End-to-End Speech-to-Text Translation From Scratch

End-to-end (E2E) speech-to-text translation (ST) often depends on pretra...
11/24/2019

Reinventing 2D Convolutions for 3D Medical Images

There has been considerable debate over 2D and 3D representation learnin...
04/11/2022

Tokenwise Contrastive Pretraining for Finer Speech-to-BERT Alignment in End-to-End Speech-to-Intent Systems

Recent advances in End-to-End (E2E) Spoken Language Understanding (SLU) ...