End-to-End Speech Recognition and Disfluency Removal with Acoustic Language Model Pretraining

09/08/2023
by   Saksham Bassi, et al.
0

The SOTA in transcription of disfluent and conversational speech has in recent years favored two-stage models, with separate transcription and cleaning stages. We believe that previous attempts at end-to-end disfluency removal have fallen short because of the representational advantage that large-scale language model pretraining has given to lexical models. Until recently, the high dimensionality and limited availability of large audio datasets inhibited the development of large-scale self-supervised pretraining objectives for learning effective audio representations, giving a relative advantage to the two-stage approach, which utilises pretrained representations for lexical tokens. In light of recent successes in large scale audio pretraining, we revisit the performance comparison between two-stage and end-to-end model and find that audio based language models pretrained using weak self-supervised objectives match or exceed the performance of similarly trained two-stage models, and further, that the choice of pretraining objective substantially effects a model's ability to be adapted to the disfluency removal task.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/09/2021

An Exploration of Self-Supervised Pretrained Representations for End-to-End Speech Recognition

Self-supervised pretraining on speech data has achieved a lot of progres...
research
08/27/2021

Injecting Text in Self-Supervised Speech Pretraining

Self-supervised pretraining for Automated Speech Recognition (ASR) has s...
research
02/08/2022

CALM: Contrastive Aligned Audio-Language Multirate and Multimodal Representations

Deriving multimodal representations of audio and lexical inputs is a cen...
research
05/30/2022

E2S2: Encoding-Enhanced Sequence-to-Sequence Pretraining for Language Understanding and Generation

Sequence-to-sequence (seq2seq) learning has become a popular trend for p...
research
10/29/2019

Depa: Self-supervised audio embedding for depression detection

Depression detection research has increased over the last few decades as...
research
09/21/2023

Leveraging In-the-Wild Data for Effective Self-Supervised Pretraining in Speaker Recognition

Current speaker recognition systems primarily rely on supervised approac...
research
04/01/2022

WavFT: Acoustic model finetuning with labelled and unlabelled data

Unsupervised and self-supervised learning methods have leveraged unlabel...

Please sign up or login with your details

Forgot password? Click here to reset