Learning English with Peppa Pig

02/25/2022
by   Mitja Nikolaus, et al.
0

Attempts to computationally simulate the acquisition of spoken language via grounding in perception have a long tradition but have gained momentum in the past few years. Current neural approaches exploit associations between the spoken and visual modality and learn to represent speech and visual data in a joint vector space. A major unresolved issue from the point of ecological validity is the training data, typically consisting of images or videos paired with spoken descriptions of what is depicted. Such a setup guarantees an unrealistically strong correlation between speech and the visual world. In the real world the coupling between the linguistic and the visual is loose, and often contains confounds in the form of correlations with non-semantic aspects of the speech signal. The current study is a first step towards simulating a naturalistic grounding scenario by using a dataset based on the children's cartoon Peppa Pig. We train a simple bi-modal architecture on the portion of the data consisting of naturalistic dialog between characters, and evaluate on segments containing descriptive narrations. Despite the weak and confounded signal in this training data our model succeeds at learning aspects of the visual semantics of spoken language.

READ FULL TEXT

page 7

page 8

page 10

research
01/25/2017

Learning Word-Like Units from Joint Audio-Visual Analysis

Given a collection of images and spoken audio captions, we present a met...
research
09/01/2022

Video-Guided Curriculum Learning for Spoken Video Grounding

In this paper, we introduce a new task, spoken video grounding (SVG), wh...
research
12/15/2019

Computational Induction of Prosodic Structure

The present study has two goals relating to the grammar of prosody, unde...
research
02/07/2017

Representations of language in a model of visually grounded speech signal

We present a visually grounded model of speech perception which projects...
research
04/24/2019

On the Contributions of Visual and Textual Supervision in Low-resource Semantic Speech Retrieval

Recent work has shown that speech paired with images can be used to lear...
research
03/23/2018

On the difficulty of a distributional semantics of spoken language

The bulk of research in the area of speech processing concerns itself wi...
research
09/14/2023

CiwaGAN: Articulatory information exchange

Humans encode information into sounds by controlling articulators and de...

Please sign up or login with your details

Forgot password? Click here to reset