Joint Audio/Text Training for Transformer Rescorer of Streaming Speech Recognition

10/31/2022
by   Suyoun Kim, et al.
0

Recently, there has been an increasing interest in two-pass streaming end-to-end speech recognition (ASR) that incorporates a 2nd-pass rescoring model on top of the conventional 1st-pass streaming ASR model to improve recognition accuracy while keeping latency low. One of the latest 2nd-pass rescoring model, Transformer Rescorer, takes the n-best initial outputs and audio embeddings from the 1st-pass model, and then choose the best output by re-scoring the n-best initial outputs. However, training this Transformer Rescorer requires expensive paired audio-text training data because the model uses audio embeddings as input. In this work, we present our Joint Audio/Text training method for Transformer Rescorer, to leverage unpaired text-only data which is relatively cheaper than paired audio-text data. We evaluate Transformer Rescorer with our Joint Audio/Text training on Librispeech dataset as well as our large-scale in-house dataset and show that our training method can improve word error rate (WER) significantly compared to standard Transformer Rescorer without requiring any extra model parameters or latency.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/26/2022

On Comparison of Encoders for Attention based End to End Speech Recognition in Standalone and Rescoring Mode

The streaming automatic speech recognition (ASR) models are more popular...
research
10/10/2021

Have best of both worlds: two-pass hybrid and E2E cascading framework for speech recognition

Hybrid and end-to-end (E2E) systems have their individual advantages, wi...
research
03/23/2023

A Deliberation-based Joint Acoustic and Text Decoder

We propose a new two-pass E2E speech recognition model that improves ASR...
research
07/02/2022

UserLibri: A Dataset for ASR Personalization Using Only Text

Personalization of speech models on mobile devices (on-device personaliz...
research
10/13/2022

JOIST: A Joint Speech and Text Streaming Model For ASR

We present JOIST, an algorithm to train a streaming, cascaded, encoder e...
research
11/02/2022

Variable Attention Masking for Configurable Transformer Transducer Speech Recognition

This work studies the use of attention masking in transformer transducer...
research
06/10/2021

U2++: Unified Two-pass Bidirectional End-to-end Model for Speech Recognition

The unified streaming and non-streaming two-pass (U2) end-to-end model f...

Please sign up or login with your details

Forgot password? Click here to reset