CSTNet: Contrastive Speech Translation Network for Self-Supervised Speech Representation Learning

06/04/2020
by   Sameer Khurana, et al.
0

More than half of the 7,000 languages in the world are in imminent danger of going extinct. Traditional methods of documenting language proceed by collecting audio data followed by manual annotation by trained linguists at different levels of granularity. This time consuming and painstaking process could benefit from machine learning. Many endangered languages do not have any orthographic form but usually have speakers that are bi-lingual and trained in a high resource language. It is relatively easy to obtain textual translations corresponding to speech. In this work, we provide a multimodal machine learning framework for speech representation learning by exploiting the correlations between the two modalities namely speech and its corresponding text translation. Here, we construct a convolutional neural network audio encoder capable of extracting linguistic representations from speech. The audio encoder is trained to perform a speech-translation retrieval task in a contrastive learning framework. By evaluating the learned representations on a phone recognition task, we demonstrate that linguistic representations emerge in the audio encoder's internal representations as a by-product of learning to perform the retrieval task.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/17/2022

SAMU-XLSR: Semantically-Aligned Multimodal Utterance-level Cross-Lingual Speech Representation

We propose the SAMU-XLSR: Semantically-Aligned Multimodal Utterance-leve...
research
06/22/2020

Self-Supervised Representations Improve End-to-End Speech Translation

End-to-end speech-to-text translation can provide a simpler and smaller ...
research
06/16/2020

AVLnet: Learning Audio-Visual Language Representations from Instructional Videos

Current methods for learning visually grounded language from videos ofte...
research
07/17/2021

Learning De-identified Representations of Prosody from Raw Audio

We propose a method for learning de-identified prosody representations f...
research
09/29/2021

Can phones, syllables, and words emerge as side-products of cross-situational audiovisual learning? – A computational investigation

Decades of research has studied how language learning infants learn to d...
research
07/20/2023

MASR: Metadata Aware Speech Representation

In the recent years, speech representation learning is constructed prima...

Please sign up or login with your details

Forgot password? Click here to reset