Pengi: An Audio Language Model for Audio Tasks

05/19/2023
by   Soham Deshmukh, et al.
0

In the domain of audio processing, Transfer Learning has facilitated the rise of Self-Supervised Learning and Zero-Shot Learning techniques. These approaches have led to the development of versatile models capable of tackling a wide array of tasks, while delivering state-of-the-art performance. However, current models inherently lack the capacity to produce the requisite language for open-ended tasks, such as Audio Captioning or Audio Question Answering. We introduce Pengi, a novel Audio Language Model that leverages Transfer Learning by framing all audio tasks as text-generation tasks. It takes as input, an audio recording, and text, and generates free-form text as output. The input audio is represented as a sequence of continuous embeddings by an audio encoder. A text encoder does the same for the corresponding text input. Both sequences are combined as a prefix to prompt a pre-trained frozen language model. The unified architecture of Pengi enables open-ended tasks and close-ended tasks without any additional fine-tuning or task-specific extensions. When evaluated on 22 downstream tasks, our approach yields state-of-the-art performance in several of them. Our results show that connecting language models with audio models is a major step towards general-purpose audio understanding

READ FULL TEXT
research
09/11/2023

Natural Language Supervision for General-Purpose Audio Representations

Audio-Language models jointly learn multimodal text and audio representa...
research
12/12/2022

Prompting Is Programming: A Query Language For Large Language Models

Large language models have demonstrated outstanding performance on a wid...
research
07/04/2023

On Conditional and Compositional Language Model Differentiable Prompting

Prompts have been shown to be an effective method to adapt a frozen Pret...
research
03/30/2023

Prefix tuning for automated audio captioning

Audio captioning aims to generate text descriptions from environmental s...
research
04/24/2023

Text-to-Audio Generation using Instruction-Tuned LLM and Latent Diffusion Model

The immense scale of the recent large language models (LLM) allows many ...
research
11/06/2022

I Hear Your True Colors: Image Guided Audio Generation

We propose Im2Wav, an image guided open-domain audio generation system. ...
research
07/29/2020

Text-based classification of interviews for mental health – juxtaposing the state of the art

Currently, the state of the art for classification of psychiatric illnes...

Please sign up or login with your details

Forgot password? Click here to reset