End-to-End Speech Recognition Contextualization with Large Language Models

09/19/2023
by   Egor Lakomkin, et al.
0

In recent years, Large Language Models (LLMs) have garnered significant attention from the research community due to their exceptional performance and generalization capabilities. In this paper, we introduce a novel method for contextualizing speech recognition models incorporating LLMs. Our approach casts speech recognition as a mixed-modal language modeling task based on a pretrained LLM. We provide audio features, along with optional text tokens for context, to train the system to complete transcriptions in a decoder-only fashion. As a result, the system is implicitly incentivized to learn how to leverage unstructured contextual information during training. Our empirical results demonstrate a significant improvement in performance, with a 6 reduction when additional textual context is provided. Moreover, we find that our method performs competitively and improve by 7.5 on rare words against a baseline contextualized RNN-T system that has been trained on more than twenty five times larger speech dataset. Overall, we demonstrate that by only adding a handful number of trainable parameters via adapters, we can unlock contextualized speech recognition capability for the pretrained LLM while keeping the same text-only input functionality.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/27/2020

Multitask Training with Text Data for End-to-End Speech Recognition

We propose a multitask training method for attention-based end-to-end sp...
research
05/22/2023

Exploring Energy-based Language Models with Different Architectures and Training Methods for Speech Recognition

Energy-based language models (ELMs) parameterize an unnormalized distrib...
research
02/17/2021

Do End-to-End Speech Recognition Models Care About Context?

The two most common paradigms for end-to-end speech recognition are conn...
research
11/11/2019

Long-span language modeling for speech recognition

We explore neural language modeling for speech recognition where the con...
research
07/02/2019

Attention model for articulatory features detection

Articulatory distinctive features, as well as phonetic transcription, pl...
research
11/04/2020

Prosodic Representation Learning and Contextual Sampling for Neural Text-to-Speech

In this paper, we introduce Kathaka, a model trained with a novel two-st...
research
10/18/2022

Personalization of CTC Speech Recognition Models

End-to-end speech recognition models trained using joint Connectionist T...

Please sign up or login with your details

Forgot password? Click here to reset