Unsupervised Cross-Modal Alignment of Speech and Text Embedding Spaces

05/18/2018
by   Yu-An Chung, et al.
0

Recent research has shown that word embedding spaces learned from text corpora of different languages can be aligned without any parallel data supervision. Inspired by the success in unsupervised cross-lingual word embeddings, in this paper we target learning a cross-modal alignment between the embedding spaces of speech and text learned from corpora of their respective modalities in an unsupervised fashion. The proposed framework learns the individual speech and text embedding spaces, and attempts to align the two spaces via adversarial training, followed by a refinement procedure. We show how our framework could be used to perform spoken word classification and translation, and the results on these two tasks demonstrate that the performance of our unsupervised alignment approach is comparable to its supervised counterpart. Our framework is especially useful for developing automatic speech recognition (ASR) and speech-to-text translation systems for low- or zero-resource languages, which have little parallel audio-text data for training modern supervised ASR and speech-to-text translation models, but account for the majority of the languages spoken across the world.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/04/2018

Towards Unsupervised Speech-to-Text Translation

We present a framework for building speech-to-text translation (ST) syst...
research
06/13/2023

Contrastive Learning-Based Audio to Lyrics Alignment for Multiple Languages

Lyrics alignment gained considerable attention in recent years. State-of...
research
01/18/2018

An Iterative Closest Point Method for Unsupervised Word Translation

Unsupervised word translation from non-parallel inter-lingual corpora ha...
research
03/25/2019

Aligning Vector-spaces with Noisy Supervised Lexicons

The problem of learning to translate between two vector spaces given a s...
research
03/22/2023

AfroDigits: A Community-Driven Spoken Digit Dataset for African Languages

The advancement of speech technologies has been remarkable, yet its inte...
research
03/15/2023

Cascading and Direct Approaches to Unsupervised Constituency Parsing on Spoken Sentences

Past work on unsupervised parsing is constrained to written form. In thi...

Please sign up or login with your details

Forgot password? Click here to reset