Syllable Discovery and Cross-Lingual Generalization in a Visually Grounded, Self-Supervised Speech Mode

05/19/2023
by   Puyuan Peng, et al.
0

In this paper, we show that representations capturing syllabic units emerge when training a self-supervised speech model with a visually-grounded training objective. We demonstrate that a nearly identical model architecture (HuBERT) trained with a masked language modeling loss does not exhibit this same ability, suggesting that the visual grounding objective is responsible for the emergence of this phenomenon. We propose the use of a minimum cut algorithm to automatically predict syllable boundaries in speech, followed by a 2-stage clustering method to group identical syllables together. We show that our model not only outperforms a state-of-the-art syllabic segmentation method on the language it was trained on (English), but also generalizes in a zero-shot fashion to Estonian. Finally, we show that the same model is capable of zero-shot generalization for a word segmentation task on 4 other languages from the Zerospeech Challenge, in some cases beating the previous state-of-the-art.

READ FULL TEXT
research
03/28/2022

Word Discovery in Visually Grounded, Self-Supervised Speech Models

We present a method for visually-grounded spoken term discovery. After t...
research
09/23/2021

Simple and Effective Zero-shot Cross-lingual Phoneme Recognition

Recent progress in self-training, self-supervised pretraining and unsupe...
research
02/07/2022

Self-Supervised Representation Learning for Speech Using Visual Grounding and Masked Language Modeling

In this paper, we describe our submissions to the ZeroSpeech 2021 Challe...
research
06/26/2023

Learning with Difference Attention for Visually Grounded Self-supervised Representations

Recent works in self-supervised learning have shown impressive results o...
research
02/08/2019

Models of Visually Grounded Speech Signal Pay Attention To Nouns: a Bilingual Experiment on English and Japanese

We investigate the behaviour of attention in neural models of visually g...
research
06/07/2022

Intra-agent speech permits zero-shot task acquisition

Human language learners are exposed to a trickle of informative, context...
research
04/20/2023

Movie Box Office Prediction With Self-Supervised and Visually Grounded Pretraining

Investments in movie production are associated with a high level of risk...

Please sign up or login with your details

Forgot password? Click here to reset