Speech-Text Dialog Pre-training for Spoken Dialog Understanding with Explicit Cross-Modal Alignment

05/19/2023
by   Tianshu Yu, et al.
0

Recently, speech-text pre-training methods have shown remarkable success in many speech and natural language processing tasks. However, most previous pre-trained models are usually tailored for one or two specific tasks, but fail to conquer a wide range of speech-text tasks. In addition, existing speech-text pre-training methods fail to explore the contextual information within a dialogue to enrich utterance representations. In this paper, we propose Speech-text dialog Pre-training for spoken dialog understanding with ExpliCiT cRoss-Modal Alignment (SPECTRA), which is the first-ever speech-text dialog pre-training model. Concretely, to consider the temporality of speech modality, we design a novel temporal position prediction task to capture the speech-text alignment. This pre-training task aims to predict the start and end time of each textual word in the corresponding speech waveform. In addition, to learn the characteristics of spoken dialogs, we generalize a response selection task from textual dialog pre-training to speech-text dialog pre-training scenarios. Experimental results on four different downstream speech-text tasks demonstrate the superiority of SPECTRA in learning speech-text alignment and multi-turn dialog context.

READ FULL TEXT

page 3

page 8

research
10/23/2020

ST-BERT: Cross-modal Language Model Pre-training For End-to-end Spoken Language Understanding

Language model pre-training has shown promising results in various downs...
research
09/23/2020

Hierarchical Pre-training for Sequence Labelling in Spoken Dialog

Sequence labelling tasks like Dialog Act and Emotion/Sentiment identific...
research
11/23/2022

Device Directedness with Contextual Cues for Spoken Dialog Systems

In this work, we define barge-in verification as a supervised learning t...
research
07/16/2022

Multimodal Dialog Systems with Dual Knowledge-enhanced Generative Pretrained Language Model

Text response generation for multimodal task-oriented dialog systems, wh...
research
10/07/2022

SpeechUT: Bridging Speech and Text with Hidden-Unit for Encoder-Decoder Based Speech-Text Pre-training

The rapid development of single-modal pre-training has prompted research...
research
11/23/2020

STEPs-RL: Speech-Text Entanglement for Phonetically Sound Representation Learning

In this paper, we present a novel multi-modal deep neural network archit...
research
07/03/2022

DailyTalk: Spoken Dialogue Dataset for Conversational Text-to-Speech

The majority of current TTS datasets, which are collections of individua...

Please sign up or login with your details

Forgot password? Click here to reset