Performance-Efficiency Trade-offs in Unsupervised Pre-training for Speech Recognition

09/14/2021
by   Felix Wu, et al.
0

This paper is a study of performance-efficiency trade-offs in pre-trained models for automatic speech recognition (ASR). We focus on wav2vec 2.0, and formalize several architecture designs that influence both the model performance and its efficiency. Putting together all our observations, we introduce SEW (Squeezed and Efficient Wav2vec), a pre-trained model architecture with significant improvements along both performance and efficiency dimensions across a variety of training setups. For example, under the 100h-960h semi-supervised setup on LibriSpeech, SEW achieves a 1.9x inference speedup compared to wav2vec 2.0, with a 13.5 word error rate. With a similar inference time, SEW reduces word error rate by 25-50

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/13/2022

HuBERT-EE: Early Exiting HuBERT for Efficient Speech Recognition

Pre-training with self-supervised models, such as Hidden-unit BERT (HuBE...
research
10/13/2022

HuBERT-TR: Reviving Turkish Automatic Speech Recognition with Self-supervised Speech Representation Learning

While the Turkish language is listed among low-resource languages, liter...
research
04/25/2022

On-demand compute reduction with stochastic wav2vec 2.0

Squeeze and Efficient Wav2vec (SEW) is a recently proposed architecture ...
research
12/07/2022

Improved Speech Pre-Training with Supervision-Enhanced Acoustic Unit

Speech pre-training has shown great success in learning useful and gener...
research
11/23/2021

Effect of noise suppression losses on speech distortion and ASR performance

Deep learning based speech enhancement has made rapid development toward...
research
03/08/2020

Development of Automatic Speech Recognition for Kazakh Language using Transfer Learning

Development of Automatic Speech Recognition system for Kazakh language i...
research
10/27/2022

TRScore: A Novel GPT-based Readability Scorer for ASR Segmentation and Punctuation model evaluation and selection

Punctuation and Segmentation are key to readability in Automatic Speech ...

Please sign up or login with your details

Forgot password? Click here to reset