JOIST: A Joint Speech and Text Streaming Model For ASR

10/13/2022
by   Tara N. Sainath, et al.
0

We present JOIST, an algorithm to train a streaming, cascaded, encoder end-to-end (E2E) model with both speech-text paired inputs, and text-only unpaired inputs. Unlike previous works, we explore joint training with both modalities, rather than pre-training and fine-tuning. In addition, we explore JOIST using a streaming E2E model with an order of magnitude more data, which are also novelties compared to previous works. Through a series of ablation studies, we explore different types of text modeling, including how to model the length of the text sequence and the appropriate text sub-word unit representation. We find that best text representation for JOIST improves WER across a variety of search and rare-word test sets by 4-14 to a model not trained with text. In addition, we quantitatively show that JOIST maintains streaming capabilities, which is important for good user-level experience.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/31/2022

Joint Audio/Text Training for Transformer Rescorer of Streaming Speech Recognition

Recently, there has been an increasing interest in two-pass streaming en...
research
02/26/2022

Integrating Text Inputs For Training and Adapting RNN Transducer ASR Models

Compared to hybrid automatic speech recognition (ASR) systems that use a...
research
07/02/2022

UserLibri: A Dataset for ASR Personalization Using Only Text

Personalization of speech models on mobile devices (on-device personaliz...
research
02/03/2022

mSLAM: Massively multilingual joint pre-training for speech and text

We present mSLAM, a multilingual Speech and LAnguage Model that learns c...
research
02/16/2023

JEIT: Joint End-to-End Model and Internal Language Model Training for Speech Recognition

We propose JEIT, a joint end-to-end (E2E) model and internal language mo...
research
08/24/2020

Improving Tail Performance of a Deliberation E2E ASR Model Using a Large Text Corpus

End-to-end (E2E) automatic speech recognition (ASR) systems lack the dis...
research
07/11/2023

Improving RNN-Transducers with Acoustic LookAhead

RNN-Transducers (RNN-Ts) have gained widespread acceptance as an end-to-...

Please sign up or login with your details

Forgot password? Click here to reset