Towards Universal Speech Discrete Tokens: A Case Study for ASR and TTS

09/14/2023
by   Yifan Yang, et al.
0

Self-supervised learning (SSL) proficiency in speech-related tasks has driven research into utilizing discrete tokens for speech tasks like recognition and translation, which offer lower storage requirements and great potential to employ natural language processing techniques. However, these studies, mainly single-task focused, faced challenges like overfitting and performance degradation in speech recognition tasks, often at the cost of sacrificing performance in multi-task scenarios. This study presents a comprehensive comparison and optimization of discrete tokens generated by various leading SSL models in speech recognition and synthesis tasks. We aim to explore the universality of speech discrete tokens across multiple speech tasks. Experimental results demonstrate that discrete tokens achieve comparable results against systems trained on FBank features in speech recognition tasks and outperform mel-spectrogram features in speech synthesis in subjective and objective metrics. These findings suggest that universal discrete tokens have enormous potential in various speech-related tasks. Our work is open-source and publicly available to facilitate research in this direction.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/14/2023

Voxtlm: unified decoder-only models for consolidating speech recognition/synthesis and speech/text continuation tasks

We propose a decoder-only language model, VoxtLM, that can perform four ...
research
09/19/2023

Discrete Audio Representation as an Alternative to Mel-Spectrograms for Speaker and Speech Recognition

Discrete audio representation, aka audio tokenization, has seen renewed ...
research
08/21/2023

TokenSplit: Using Discrete Speech Representations for Direct, Refined, and Transcript-Conditioned Speech Separation and Recognition

We present TokenSplit, a speech separation model that acts on discrete t...
research
05/29/2023

Exploration of Efficient End-to-End ASR using Discretized Input from Self-Supervised Learning

Self-supervised learning (SSL) of speech has shown impressive results in...
research
05/21/2023

Laughter Synthesis using Pseudo Phonetic Tokens with a Large-scale In-the-wild Laughter Corpus

We present a large-scale in-the-wild Japanese laughter corpus and a laug...
research
10/27/2022

Exploring Effective Distillation of Self-Supervised Speech Models for Automatic Speech Recognition

Recent years have witnessed great strides in self-supervised learning (S...
research
05/31/2023

Perception and Semantic Aware Regularization for Sequential Confidence Calibration

Deep sequence recognition (DSR) models receive increasing attention due ...

Please sign up or login with your details

Forgot password? Click here to reset