Mu^2SLAM: Multitask, Multilingual Speech and Language Models

12/19/2022
by   Yong Cheng, et al.
0

We present Mu^2SLAM, a multilingual sequence-to-sequence model pre-trained jointly on unlabeled speech, unlabeled text and supervised data spanning Automatic Speech Recognition (ASR), Automatic Speech Translation (AST) and Machine Translation (MT), in over 100 languages. By leveraging a quantized representation of speech as a target, Mu^2SLAM trains the speech-text models with a sequence-to-sequence masked denoising objective similar to T5 on the decoder and a masked language modeling (MLM) objective on the encoder, for both unlabeled speech and text, while utilizing the supervised tasks to improve cross-lingual and cross-modal representation alignment within the model. On CoVoST AST, Mu^2SLAM establishes a new state-of-the-art for models trained on public datasets, improving on xx-en translation over the previous best by 1.9 BLEU points and on en-xx translation by 1.1 BLEU points. On Voxpopuli ASR, our model matches the performance of an mSLAM model fine-tuned with an RNN-T decoder, despite using a relatively weaker sequence-to-sequence architecture. On text understanding tasks, our model improves by more than 6% over mSLAM on XNLI, getting closer to the performance of mT5 models of comparable capacity on XNLI and TydiQA, paving the way towards a single model for all speech and text understanding tasks.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/01/2023

Improved Cross-Lingual Transfer Learning For Automatic Speech Translation

Research in multilingual speech-to-text translation is topical. Having a...
research
03/02/2023

Google USM: Scaling Automatic Speech Recognition Beyond 100 Languages

We introduce the Universal Speech Model (USM), a single large model that...
research
10/07/2022

SpeechUT: Bridging Speech and Text with Hidden-Unit for Encoder-Decoder Based Speech-Text Pre-training

The rapid development of single-modal pre-training has prompted research...
research
04/05/2016

Character-Level Neural Translation for Multilingual Media Monitoring in the SUMMA Project

The paper steps outside the comfort-zone of the traditional NLP tasks li...
research
08/11/2023

Improving Joint Speech-Text Representations Without Alignment

The last year has seen astonishing progress in text-prompted image gener...
research
05/22/2023

Duplex Diffusion Models Improve Speech-to-Speech Translation

Speech-to-speech translation is a typical sequence-to-sequence learning ...
research
10/21/2020

A General Multi-Task Learning Framework to Leverage Text Data for Speech to Text Tasks

Attention-based sequence-to-sequence modeling provides a powerful and el...

Please sign up or login with your details

Forgot password? Click here to reset