Generative Spoken Language Modeling from Raw Audio

02/01/2021
by   Kushal Lakhotia, et al.
11

Generative spoken language modeling involves learning jointly the acoustic and linguistic characteristics of a language from raw audio only (without text or labels). We introduce metrics to automatically evaluate the generated output in terms of acoustic and linguistic quality in two associated end-to-end tasks, respectively: speech resynthesis (repeating the speech input using the system's own voice), and speech generation (producing novel speech outputs conditional on a spoken prompt, or unconditionally), and validate these metrics with human judgment. We test baseline systems consisting of a discrete speech encoder (returning discrete, low bitrate, pseudo-text units), a generative language model (trained on pseudo-text units), and a speech decoder (generating a waveform from pseudo-text). By comparing three state-of-the-art unsupervised speech encoders (Contrastive Predictive Coding (CPC), wav2vec 2.0, HuBERT), and varying the number of discrete units (50, 100, 200), we investigate how the generative performance depends on the quality of the learned units as measured by unsupervised metrics (zero-shot probe tasks). We will open source our evaluation stack and baseline models.

READ FULL TEXT
research
03/11/2022

Are discrete units necessary for Spoken Language Modeling?

Recent work in spoken language modeling shows the possibility of learnin...
research
11/23/2020

The Zero Resource Speech Benchmark 2021: Metrics and baselines for unsupervised spoken language modeling

We introduce a new unsupervised task, spoken language modeling: the lear...
research
04/29/2021

The Interspeech Zero Resource Speech Challenge 2021: Spoken language modelling

We present the Zero Resource Speech Challenge 2021, which asks participa...
research
10/31/2021

Towards Language Modelling in the Speech Domain Using Sub-word Linguistic Units

Language models (LMs) for text data have been studied extensively for th...
research
09/07/2021

Text-Free Prosody-Aware Generative Spoken Language Modeling

Speech pre-training has primarily demonstrated efficacy on classificatio...
research
03/30/2022

Generative Spoken Dialogue Language Modeling

We introduce dGSLM, the first "textless" model able to generate audio sa...
research
10/21/2022

A Textless Metric for Speech-to-Speech Comparison

This paper proposes a textless speech-to-speech comparison metric that a...

Please sign up or login with your details

Forgot password? Click here to reset