SpeechTokenizer: Unified Speech Tokenizer for Speech Large Language Models

08/31/2023
by   Xin Zhang, et al.
0

Current speech large language models build upon discrete speech representations, which can be categorized into semantic tokens and acoustic tokens. However, existing speech tokens are not specifically designed for speech language modeling. To assess the suitability of speech tokens for building speech language models, we established the first benchmark, SLMTokBench. Our results indicate that neither semantic nor acoustic tokens are ideal for this purpose. Therefore, we propose SpeechTokenizer, a unified speech tokenizer for speech large language models. SpeechTokenizer adopts the Encoder-Decoder architecture with residual vector quantization (RVQ). Unifying semantic and acoustic tokens, SpeechTokenizer disentangles different aspects of speech information hierarchically across different RVQ layers. Furthermore, We construct a Unified Speech Language Model (USLM) leveraging SpeechTokenizer. Experiments show that SpeechTokenizer performs comparably to EnCodec in speech reconstruction and demonstrates strong performance on the SLMTokBench benchmark. Also, USLM outperforms VALL-E in zero-shot Text-to-Speech tasks. Code and models are available at https://github.com/ZhangXInFD/SpeechTokenizer/.

READ FULL TEXT

page 17

page 18

research
08/31/2023

RepCodec: A Speech Representation Codec for Speech Tokenization

With recent rapid growth of large language models (LLMs), discrete speec...
research
10/22/2020

UniCase – Rethinking Casing in Language Models

In this paper, we introduce a new approach to dealing with the problem o...
research
04/25/2023

Semantic Compression With Large Language Models

The rise of large language models (LLMs) is revolutionizing information ...
research
06/03/2023

SpeechGen: Unlocking the Generative Power of Speech Language Models with Prompts

Large language models (LLMs) have gained considerable attention for Arti...
research
06/26/2023

MotionGPT: Human Motion as a Foreign Language

Though the advancement of pre-trained large language models unfolds, the...
research
08/24/2022

Induced Natural Language Rationales and Interleaved Markup Tokens Enable Extrapolation in Large Language Models

The ability to extrapolate, i.e., to make predictions on sequences that ...
research
07/29/2023

GeneMask: Fast Pretraining of Gene Sequences to Enable Few-Shot Learning

Large-scale language models such as DNABert and LOGO aim to learn optima...

Please sign up or login with your details

Forgot password? Click here to reset