AudioGPT: Understanding and Generating Speech, Music, Sound, and Talking Head

04/25/2023
by   Rongjie Huang, et al.
7

Large language models (LLMs) have exhibited remarkable capabilities across a variety of domains and tasks, challenging our understanding of learning and cognition. Despite the recent success, current LLMs are not capable of processing complex audio information or conducting spoken conversations (like Siri or Alexa). In this work, we propose a multi-modal AI system named AudioGPT, which complements LLMs (i.e., ChatGPT) with 1) foundation models to process complex audio information and solve numerous understanding and generation tasks; and 2) the input/output interface (ASR, TTS) to support spoken dialogue. With an increasing demand to evaluate multi-modal LLMs of human intention understanding and cooperation with foundation models, we outline the principles and processes and test AudioGPT in terms of consistency, capability, and robustness. Experimental results demonstrate the capabilities of AudioGPT in solving AI tasks with speech, music, sound, and talking head understanding and generation in multi-round dialogues, which empower humans to create rich and diverse audio content with unprecedented ease. Our system is publicly available at <https://github.com/AIGC-Audio/AudioGPT>.

READ FULL TEXT

page 2

page 9

page 10

page 11

research
08/30/2023

LLaSM: Large Language and Speech Model

Multi-modal large language models have garnered significant interest rec...
research
03/08/2023

Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models

ChatGPT is attracting a cross-field interest as it provides a language i...
research
01/11/2022

Music2Video: Automatic Generation of Music Video with fusion of audio and text

Creation of images using generative adversarial networks has been widely...
research
10/26/2022

A Case for Business Process-Specific Foundation Models

The inception of large language models has helped advance state-of-the-a...
research
11/26/2020

SLURP: A Spoken Language Understanding Resource Package

Spoken Language Understanding infers semantic meaning directly from audi...
research
08/17/2023

A Survey on Deep Multi-modal Learning for Body Language Recognition and Generation

Body language (BL) refers to the non-verbal communication expressed thro...
research
12/02/2020

Top-1 CORSMAL Challenge 2020 Submission: Filling Mass Estimation Using Multi-modal Observations of Human-robot Handovers

Human-robot object handover is a key skill for the future of human-robot...

Please sign up or login with your details

Forgot password? Click here to reset