Improving Joint Speech-Text Representations Without Alignment

08/11/2023
by   Cal Peyser, et al.
0

The last year has seen astonishing progress in text-prompted image generation premised on the idea of a cross-modal representation space in which the text and image domains are represented jointly. In ASR, this idea has found application as joint speech-text encoders that can scale to the capacities of very large parameter models by being trained on both unpaired speech and text. While these methods show promise, they have required special treatment of the sequence-length mismatch inherent in speech and text, either by up-sampling heuristics or an explicit alignment model. In this work, we offer evidence that joint speech-text encoders naturally achieve consistent representations across modalities by disregarding sequence length, and argue that consistency losses could forgive length differences and simply assume the best alignment. We show that such a loss improves downstream WER in both a large-parameter monolingual and multilingual system.

READ FULL TEXT
research
02/03/2022

mSLAM: Massively multilingual joint pre-training for speech and text

We present mSLAM, a multilingual Speech and LAnguage Model that learns c...
research
11/24/2022

TESSP: Text-Enhanced Self-Supervised Speech Pre-training

Self-supervised speech pre-training empowers the model with the contextu...
research
12/19/2022

Mu^2SLAM: Multitask, Multilingual Speech and Language Models

We present Mu^2SLAM, a multilingual sequence-to-sequence model pre-train...
research
04/07/2022

MAESTRO: Matched Speech Text Representations through Modality Matching

We present Maestro, a self-supervised training method to unify represent...
research
06/08/2023

Speech-to-Text Adapter and Speech-to-Entity Retriever Augmented LLMs for Speech Understanding

Large Language Models (LLMs) have been applied in the speech domain, oft...
research
11/01/2022

Speech-text based multi-modal training with bidirectional attention for improved speech recognition

To let the state-of-the-art end-to-end ASR model enjoy data efficiency, ...
research
05/05/2022

Unsupervised Mismatch Localization in Cross-Modal Sequential Data

Content mismatch usually occurs when data from one modality is translate...

Please sign up or login with your details

Forgot password? Click here to reset