Voxtlm: unified decoder-only models for consolidating speech recognition/synthesis and speech/text continuation tasks

09/14/2023
by   Soumi Maiti, et al.
0

We propose a decoder-only language model, VoxtLM, that can perform four tasks: speech recognition, speech synthesis, text generation, and speech continuation. VoxtLM integrates text vocabulary with discrete speech tokens from self-supervised speech features and uses special tokens to enable multitask learning. Compared to a single-task model, VoxtLM exhibits a significant improvement in speech synthesis, with improvements in both speech intelligibility from 28.9 to 5.6 and objective quality from 2.68 to 3.90. VoxtLM also improves speech generation and speech recognition performance over the single-task counterpart. VoxtLM is trained with publicly available data and training recipes and model checkpoints will be open-sourced to make fully reproducible work.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/14/2023

Towards Universal Speech Discrete Tokens: A Case Study for ASR and TTS

Self-supervised learning (SSL) proficiency in speech-related tasks has d...
research
05/31/2021

Byakto Speech: Real-time long speech synthesis with convolutional neural network: Transfer learning from English to Bangla

Speech synthesis is one of the challenging tasks to automate by deep lea...
research
02/26/2018

Deep Feed-forward Sequential Memory Networks for Speech Synthesis

The Bidirectional LSTM (BLSTM) RNN based speech synthesis system is amon...
research
08/21/2023

TokenSplit: Using Discrete Speech Representations for Direct, Refined, and Transcript-Conditioned Speech Separation and Recognition

We present TokenSplit, a speech separation model that acts on discrete t...
research
05/21/2023

Laughter Synthesis using Pseudo Phonetic Tokens with a Large-scale In-the-wild Laughter Corpus

We present a large-scale in-the-wild Japanese laughter corpus and a laug...
research
04/19/2019

A Novel Task-Oriented Text Corpus in Silent Speech Recognition and its Natural Language Generation Construction Method

Millions of people with severe speech disorders around the world may reg...
research
10/27/2022

Evaluating context-invariance in unsupervised speech representations

Unsupervised speech representations have taken off, with benchmarks (SUP...

Please sign up or login with your details

Forgot password? Click here to reset