SAMU-XLSR: Semantically-Aligned Multimodal Utterance-level Cross-Lingual Speech Representation

05/17/2022
by   Sameer Khurana, et al.
0

We propose the SAMU-XLSR: Semantically-Aligned Multimodal Utterance-level Cross-Lingual Speech Representation learning framework. Unlike previous works on speech representation learning, which learns multilingual contextual speech embedding at the resolution of an acoustic frame (10-20ms), this work focuses on learning multimodal (speech-text) multilingual speech embedding at the resolution of a sentence (5-10s) such that the embedding vector space is semantically aligned across different languages. We combine state-of-the-art multilingual acoustic frame-level speech representation learning model XLS-R with the Language Agnostic BERT Sentence Embedding (LaBSE) model to create an utterance-level multimodal multilingual speech encoder SAMU-XLSR. Although we train SAMU-XLSR with only multilingual transcribed speech data, cross-lingual speech-text and speech-speech associations emerge in its learned representation space. To substantiate our claims, we use SAMU-XLSR speech encoder in combination with a pre-trained LaBSE text sentence encoder for cross-lingual speech-to-text translation retrieval, and SAMU-XLSR alone for cross-lingual speech-to-speech translation retrieval. We highlight these applications by performing several cross-lingual text and speech translation retrieval tasks across several datasets.

READ FULL TEXT

page 1

page 4

page 16

research
02/03/2023

Modeling Sequential Sentence Relation to Improve Cross-lingual Dense Retrieval

Recently multi-lingual pre-trained language models (PLM) such as mBERT a...
research
03/21/2022

XTREME-S: Evaluating Cross-lingual Speech Representations

We introduce XTREME-S, a new benchmark to evaluate universal cross-lingu...
research
10/11/2022

On the Use of Semantically-Aligned Speech Representations for Spoken Language Understanding

In this paper we examine the use of semantically-aligned speech represen...
research
06/04/2020

CSTNet: Contrastive Speech Translation Network for Self-Supervised Speech Representation Learning

More than half of the 7,000 languages in the world are in imminent dange...
research
07/04/2022

Unify and Conquer: How Phonetic Feature Representation Affects Polyglot Text-To-Speech (TTS)

An essential design decision for multilingual Neural Text-To-Speech (NTT...
research
04/11/2020

LAReQA: Language-agnostic answer retrieval from a multilingual pool

We present LAReQA, a challenging new benchmark for language-agnostic ans...
research
06/29/2023

Learning Multilingual Expressive Speech Representation for Prosody Prediction without Parallel Data

We propose a method for speech-to-speech emotionpreserving translation t...

Please sign up or login with your details

Forgot password? Click here to reset