TokenSplit: Using Discrete Speech Representations for Direct, Refined, and Transcript-Conditioned Speech Separation and Recognition

08/21/2023
by   Hakan Erdogan, et al.
0

We present TokenSplit, a speech separation model that acts on discrete token sequences. The model is trained on multiple tasks simultaneously: separate and transcribe each speech source, and generate speech from text. The model operates on transcripts and audio token sequences and achieves multiple tasks through masking of inputs. The model is a sequence-to-sequence encoder-decoder model that uses the Transformer architecture. We also present a "refinement" version of the model that predicts enhanced audio tokens from the audio tokens of speech separated by a conventional separation model. Using both objective metrics and subjective MUSHRA listening tests, we show that our model achieves excellent performance in terms of separation, both with or without transcript conditioning. We also measure the automatic speech recognition (ASR) performance and provide audio samples of speech synthesis to demonstrate the additional utility of our model.

READ FULL TEXT
research
05/18/2020

Mask CTC: Non-Autoregressive End-to-End ASR with CTC and Mask Predict

We present Mask CTC, a novel non-autoregressive end-to-end automatic spe...
research
09/14/2023

Voxtlm: unified decoder-only models for consolidating speech recognition/synthesis and speech/text continuation tasks

We propose a decoder-only language model, VoxtLM, that can perform four ...
research
09/04/2020

What the Future Brings: Investigating the Impact of Lookahead for Incremental Neural TTS

In incremental text to speech synthesis (iTTS), the synthesizer produces...
research
09/14/2023

Towards Universal Speech Discrete Tokens: A Case Study for ASR and TTS

Self-supervised learning (SSL) proficiency in speech-related tasks has d...
research
04/20/2020

WHALETRANS: E2E WHisper to nAturaL spEech conversion using modified TRANSformer network

In this article, we investigate whispered-to natural-speech conversion m...
research
02/12/2021

Multimodal Punctuation Prediction with Contextual Dropout

Automatic speech recognition (ASR) is widely used in consumer electronic...
research
04/13/2023

Efficient Sequence Transduction by Jointly Predicting Tokens and Durations

This paper introduces a novel Token-and-Duration Transducer (TDT) archit...

Please sign up or login with your details

Forgot password? Click here to reset