Log In Sign Up

From phonemes to images: levels of representation in a recurrent neural model of visually-grounded language learning

by   Lieke Gelderloos, et al.

We present a model of visually-grounded language learning based on stacked gated recurrent neural networks which learns to predict visual features given an image description in the form of a sequence of phonemes. The learning task resembles that faced by human language learners who need to discover both structure and meaning from noisy and ambiguous data across modalities. We show that our model indeed learns to predict features of the visual context given phonetically transcribed image descriptions, and show that it represents linguistic information in a hierarchy of levels: lower layers in the stack are comparatively more sensitive to form, whereas higher layers are more sensitive to meaning.


page 1

page 2

page 3

page 4


Representations of language in a model of visually grounded speech signal

We present a visually grounded model of speech perception which projects...

iParaphrasing: Extracting Visually Grounded Paraphrases via an Image

A paraphrase is a restatement of the meaning of a text in other words. P...

Pragmatics in Grounded Language Learning: Phenomena, Tasks, and Modeling Approaches

People rely heavily on context to enrich meaning beyond what is literall...

Encoding of phonology in a recurrent neural model of grounded speech

We study the representation and encoding of phonemes in a recurrent neur...

Where to put the Image in an Image Caption Generator

When a neural language model is used for caption generation, the image i...

Teaching Machines to Code: Neural Markup Generation with Visual Attention

We present a deep recurrent neural network model with soft visual attent...

What is Learned in Visually Grounded Neural Syntax Acquisition

Visual features are a promising signal for learning bootstrap textual mo...