AudioPaLM: A Large Language Model That Can Speak and Listen

06/22/2023
by   Paul K. Rubenstein, et al.
0

We introduce AudioPaLM, a large language model for speech understanding and generation. AudioPaLM fuses text-based and speech-based language models, PaLM-2 [Anil et al., 2023] and AudioLM [Borsos et al., 2022], into a unified multimodal architecture that can process and generate text and speech with applications including speech recognition and speech-to-speech translation. AudioPaLM inherits the capability to preserve paralinguistic information such as speaker identity and intonation from AudioLM and the linguistic knowledge present only in text large language models such as PaLM-2. We demonstrate that initializing AudioPaLM with the weights of a text-only large language model improves speech processing, successfully leveraging the larger quantity of text training data used in pretraining to assist with the speech tasks. The resulting model significantly outperforms existing systems for speech translation tasks and has the ability to perform zero-shot speech-to-text translation for many languages for which input/target language combinations were not seen in training. AudioPaLM also demonstrates features of audio language models, such as transferring a voice across languages based on a short spoken prompt. We release examples of our method at https://google-research.github.io/seanet/audiopalm/examples

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/24/2023

ComSL: A Composite Speech-Language Model for End-to-End Speech-to-Text Translation

Joint speech-language training is challenging due to the large demand fo...
research
06/05/2023

BeAts: Bengali Speech Acts Recognition using Multimodal Attention Fusion

Spoken languages often utilise intonation, rhythm, intensity, and struct...
research
08/01/2023

Advancing Beyond Identification: Multi-bit Watermark for Language Models

This study aims to proactively tackle misuse of large language models be...
research
11/01/2021

PerSpeechNorm: A Persian Toolkit for Speech Processing Normalization

In general, speech processing models consist of a language model along w...
research
09/20/2023

Towards Joint Modeling of Dialogue Response and Speech Synthesis based on Large Language Model

This paper explores the potential of constructing an AI spoken dialogue ...
research
05/17/2023

Using a Large Language Model to Control Speaking Style for Expressive TTS

Appropriate prosody is critical for successful spoken communication. Con...
research
06/08/2023

Speech-to-Text Adapter and Speech-to-Entity Retriever Augmented LLMs for Speech Understanding

Large Language Models (LLMs) have been applied in the speech domain, oft...

Please sign up or login with your details

Forgot password? Click here to reset