LLaSM: Large Language and Speech Model

08/30/2023
by   Yu Shu, et al.
0

Multi-modal large language models have garnered significant interest recently. Though, most of the works focus on vision-language multi-modal models providing strong capabilities in following vision-and-language instructions. However, we claim that speech is also an important modality through which humans interact with the world. Hence, it is crucial for a general-purpose assistant to be able to follow multi-modal speech-and-language instructions. In this work, we propose Large Language and Speech Model (LLaSM). LLaSM is an end-to-end trained large multi-modal speech-language model with cross-modal conversational abilities, capable of following speech-and-language instructions. Our early experiments show that LLaSM demonstrates a more convenient and natural way for humans to interact with artificial intelligence. Specifically, we also release a large Speech Instruction Following dataset LLaSM-Audio-Instructions. Code and demo are available at https://github.com/LinkSoul-AI/LLaSM and https://huggingface.co/spaces/LinkSoul/LLaSM. The LLaSM-Audio-Instructions dataset is available at https://huggingface.co/datasets/LinkSoul/LLaSM-Audio-Instructions.

READ FULL TEXT

page 3

page 4

page 6

research
05/18/2023

SpeechGPT: Empowering Large Language Models with Intrinsic Cross-Modal Conversational Abilities

Multi-modal large language models are regarded as a crucial step towards...
research
12/08/2022

OFASys: A Multi-Modal Multi-Task Learning System for Building Generalist Models

Generalist models, which are capable of performing diverse multi-modal t...
research
04/28/2023

LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model

How to efficiently transform large language models (LLMs) into instructi...
research
04/25/2023

AudioGPT: Understanding and Generating Speech, Music, Sound, and Talking Head

Large language models (LLMs) have exhibited remarkable capabilities acro...
research
05/18/2023

Instruct2Act: Mapping Multi-modality Instructions to Robotic Actions with Large Language Model

Foundation models have made significant strides in various applications,...
research
09/03/2021

A Longitudinal Multi-modal Dataset for Dementia Monitoring and Diagnosis

Dementia is a family of neurogenerative conditions affecting memory and ...
research
09/01/2023

Point-Bind Point-LLM: Aligning Point Cloud with Multi-modality for 3D Understanding, Generation, and Instruction Following

We introduce Point-Bind, a 3D multi-modality model aligning point clouds...

Please sign up or login with your details

Forgot password? Click here to reset