Harnessing the Zero-Shot Power of Instruction-Tuned Large Language Model in End-to-End Speech Recognition

09/19/2023
by   Yosuke Higuchi, et al.
0

We present a novel integration of an instruction-tuned large language model (LLM) and end-to-end automatic speech recognition (ASR). Modern LLMs can perform a wide range of linguistic tasks within zero-shot learning when provided with a precise instruction or a prompt to guide the text generation process towards the desired task. We explore using this zero-shot capability of LLMs to extract linguistic information that can contribute to improving ASR performance. Specifically, we direct an LLM to correct grammatical errors in an ASR hypothesis and harness the embedded linguistic knowledge to conduct end-to-end ASR. The proposed model is built on the hybrid connectionist temporal classification (CTC) and attention architecture, where an instruction-tuned LLM (i.e., Llama2) is employed as a front-end of the decoder. An ASR hypothesis, subject to correction, is obtained from the encoder via CTC decoding, which is then fed into the LLM along with an instruction. The decoder subsequently takes as input the LLM embeddings to perform sequence generation, incorporating acoustic information from the encoder output. Experimental results and analyses demonstrate that the proposed integration yields promising performance improvements, and our approach largely benefits from LLM-based rescoring.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/13/2021

Zero-shot Speech Translation

Speech Translation (ST) is the task of translating speech in one languag...
research
10/22/2020

How Phonotactics Affect Multilingual and Zero-shot ASR Performance

The idea of combining multiple languages' recordings to train a single a...
research
11/25/2019

Independent language modeling architecture for end-to-end ASR

The attention-based end-to-end (E2E) automatic speech recognition (ASR) ...
research
04/01/2022

Zero-Shot Cross-lingual Aphasia Detection using Automatic Speech Recognition

Aphasia is a common speech and language disorder, typically caused by a ...
research
02/12/2022

USTED: Improving ASR with a Unified Speech and Text Encoder-Decoder

Improving end-to-end speech recognition by incorporating external text d...
research
10/25/2022

Linguistic-Enhanced Transformer with CTC Embedding for Speech Recognition

The recent emergence of joint CTC-Attention model shows significant impr...
research
10/08/2021

Hierarchical Conditional End-to-End ASR with CTC and Multi-Granular Subword Units

In end-to-end automatic speech recognition (ASR), a model is expected to...

Please sign up or login with your details

Forgot password? Click here to reset