Textually Pretrained Speech Language Models

05/22/2023
by   Michael Hassid, et al.
0

Speech language models (SpeechLMs) process and generate acoustic data only, without textual supervision. In this work, we propose TWIST, a method for training SpeechLMs using a warm-start from a pretrained textual language models. We show using both automatic and human evaluations that TWIST outperforms a cold-start SpeechLM across the board. We empirically analyze the effect of different model design choices such as the speech tokenizer, the pretrained textual model, and the dataset size. We find that model and dataset scale both play an important role in constructing better-performing SpeechLMs. Based on our observations, we present the largest (to the best of our knowledge) SpeechLM both in terms of number of parameters and training data. We additionally introduce two spoken versions of the StoryCloze textual benchmark to further improve model evaluation and advance future research in the field. Speech samples can be found on our website: https://pages.cs.huji.ac.il/adiyoss-lab/twist/ .

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/01/2022

T5lephone: Bridging Speech and Text Self-supervised Models for Spoken Language Understanding via Phoneme level T5

In Spoken language understanding (SLU), a natural solution is concatenat...
research
07/03/2023

Evaluating Shutdown Avoidance of Language Models in Textual Scenarios

Recently, there has been an increase in interest in evaluating large lan...
research
11/01/2021

PerSpeechNorm: A Persian Toolkit for Speech Processing Normalization

In general, speech processing models consist of a language model along w...
research
09/17/2023

Augmenting text for spoken language understanding with Large Language Models

Spoken semantic parsing (SSP) involves generating machine-comprehensible...
research
10/29/2020

How Many Pages? Paper Length Prediction from the Metadata

Being able to predict the length of a scientific paper may be helpful in...
research
09/04/2023

A Comparative Analysis of Pretrained Language Models for Text-to-Speech

State-of-the-art text-to-speech (TTS) systems have utilized pretrained l...

Please sign up or login with your details

Forgot password? Click here to reset