Can phones, syllables, and words emerge as side-products of cross-situational audiovisual learning? – A computational investigation

09/29/2021
by   Khazar Khorrami, et al.
7

Decades of research has studied how language learning infants learn to discriminate speech sounds, segment words, and associate words with their meanings. While gradual development of such capabilities is unquestionable, the exact nature of these skills and the underlying mental representations yet remains unclear. In parallel, computational studies have shown that basic comprehension of speech can be achieved by statistical learning between speech and concurrent referentially ambiguous visual input. These models can operate without prior linguistic knowledge such as representations of linguistic units, and without learning mechanisms specifically targeted at such units. This has raised the question of to what extent knowledge of linguistic units, such as phone(me)s, syllables, and words, could actually emerge as latent representations supporting the translation between speech and representations in other modalities, and without the units being proximal learning targets for the learner. In this study, we formulate this idea as the so-called latent language hypothesis (LLH), connecting linguistic representation learning to general predictive processing within and across sensory modalities. We review the extent that the audiovisual aspect of LLH is supported by the existing computational studies. We then explore LLH further in extensive learning simulations with different neural network models for audiovisual cross-situational learning, and comparing learning from both synthetic and real speech data. We investigate whether the latent representations learned by the networks reflect phonetic, syllabic, or lexical structure of input speech by utilizing an array of complementary evaluation metrics related to linguistic selectivity and temporal characteristics of the representations. As a result, we find that representations associated...

READ FULL TEXT

page 8

page 14

page 22

page 29

page 36

page 37

research
06/24/2019

A computational model of early language acquisition from audiovisual experiences of young infants

Earlier research has suggested that human infants might use statistical ...
research
06/30/2023

What do self-supervised speech models know about words?

Many self-supervised speech models (S3Ms) have been introduced over the ...
research
04/05/2023

Quantifying the Roles of Visual, Linguistic, and Visual-Linguistic Complexity in Verb Acquisition

Children typically learn the meanings of nouns earlier than the meanings...
research
01/21/2016

On Structured Sparsity of Phonological Posteriors for Linguistic Parsing

The speech signal conveys information on different time scales from shor...
research
06/04/2020

CSTNet: Contrastive Speech Translation Network for Self-Supervised Speech Representation Learning

More than half of the 7,000 languages in the world are in imminent dange...
research
07/08/2020

Analysis of Predictive Coding Models for Phonemic Representation Learning in Small Datasets

Neural network models using predictive coding are interesting from the v...
research
01/18/2022

Unsupervised Multimodal Word Discovery based on Double Articulation Analysis with Co-occurrence cues

Human infants acquire their verbal lexicon from minimal prior knowledge ...

Please sign up or login with your details

Forgot password? Click here to reset