RepCodec: A Speech Representation Codec for Speech Tokenization

08/31/2023
by   Zhichao Huang, et al.
0

With recent rapid growth of large language models (LLMs), discrete speech tokenization has played an important role for injecting speech into LLMs. However, this discretization gives rise to a loss of information, consequently impairing overall performance. To improve the performance of these discrete speech tokens, we present RepCodec, a novel speech representation codec for semantic speech tokenization. In contrast to audio codecs which reconstruct the raw audio, RepCodec learns a vector quantization codebook through reconstructing speech representations from speech encoders like HuBERT or data2vec. Together, the speech encoder, the codec encoder and the vector quantization codebook form a pipeline for converting speech waveforms into semantic tokens. The extensive experiments illustrate that RepCodec, by virtue of its enhanced information retention capacity, significantly outperforms the widely used k-means clustering approach in both speech understanding and generation. Furthermore, this superiority extends across various speech encoders and languages, affirming the robustness of RepCodec. We believe our method can facilitate large language modeling research on speech processing.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/31/2023

SpeechTokenizer: Unified Speech Tokenizer for Speech Large Language Models

Current speech large language models build upon discrete speech represen...
research
05/24/2023

LMs with a Voice: Spoken Language Modeling beyond Speech Tokens

We present SPECTRON, a novel approach to adapting pre-trained language m...
research
09/19/2023

Discrete Audio Representation as an Alternative to Mel-Spectrograms for Speaker and Speech Recognition

Discrete audio representation, aka audio tokenization, has seen renewed ...
research
03/29/2022

Autoregressive Co-Training for Learning Discrete Speech Representations

While several self-supervised approaches for learning discrete speech re...
research
09/07/2022

AudioLM: a Language Modeling Approach to Audio Generation

We introduce AudioLM, a framework for high-quality audio generation with...
research
09/08/2023

Leveraging Pretrained Image-text Models for Improving Audio-Visual Learning

Visually grounded speech systems learn from paired images and their spok...

Please sign up or login with your details

Forgot password? Click here to reset