Prompting Large Language Models with Speech Recognition Abilities

07/21/2023
by   Yassir Fathullah, et al.
0

Large language models have proven themselves highly flexible, able to solve a wide range of generative tasks, such as abstractive summarization and open-ended question answering. In this paper we extend the capabilities of LLMs by directly attaching a small audio encoder allowing it to perform speech recognition. By directly prepending a sequence of audial embeddings to the text token embeddings, the LLM can be converted to an automatic speech recognition (ASR) system, and be used in the exact same manner as its textual counterpart. Experiments on Multilingual LibriSpeech (MLS) show that incorporating a conformer encoder into the open sourced LLaMA-7B allows it to outperform monolingual baselines by 18 despite LLaMA being trained overwhelmingly on English text. Furthermore, we perform ablation studies to investigate whether the LLM can be completely frozen during training to maintain its original capabilities, scaling up the audio encoder, and increasing the audio encoder striding to generate fewer embeddings. The results from these studies show that multilingual ASR is possible even when the LLM is frozen or when strides of almost 1 second are used in the audio encoder opening up the possibility for LLMs to operate on long-form audio.

READ FULL TEXT
research
10/11/2022

Scaling Up Deliberation for Multilingual ASR

Multilingual end-to-end automatic speech recognition models are attracti...
research
01/19/2023

From English to More Languages: Parameter-Efficient Model Reprogramming for Cross-Lingual Speech Recognition

In this work, we propose a new parameter-efficient learning framework ba...
research
11/22/2018

Bytes are All You Need: End-to-End Multilingual Speech Recognition and Synthesis with Bytes

We present two end-to-end models: Audio-to-Byte (A2B) and Byte-to-Audio ...
research
06/05/2022

LAE: Language-Aware Encoder for Monolingual and Multilingual ASR

Despite the rapid progress in automatic speech recognition (ASR) researc...
research
11/25/2020

Bootstrap an end-to-end ASR system by multilingual training, transfer learning, text-to-text mapping and synthetic audio

Bootstrapping speech recognition on limited data resources has been an a...
research
09/18/2023

Instruction-Following Speech Recognition

Conventional end-to-end Automatic Speech Recognition (ASR) models primar...
research
09/11/2023

Minuteman: Machine and Human Joining Forces in Meeting Summarization

Many meetings require creating a meeting summary to keep everyone up to ...

Please sign up or login with your details

Forgot password? Click here to reset