Log In Sign Up

Visually grounded models of spoken language: A survey of datasets, architectures and evaluation techniques

by   Grzegorz Chrupała, et al.

This survey provides an overview of the evolution of visually grounded models of spoken language over the last 20 years. Such models are inspired by the observation that when children pick up a language, they rely on a wide range of indirect and noisy clues, crucially including signals from the visual modality co-occurring with spoken utterances. Several fields have made important contributions to this approach to modeling or mimicking the process of learning language: Machine Learning, Natural Language and Speech Processing, Computer Vision and Cognitive Science. The current paper brings together these contributions in order to provide a useful introduction and overview for practitioners in all these areas. We discuss the central research questions addressed, the timeline of developments, and the datasets which enabled much of this work. We then summarize the main modeling architectures and offer an exhaustive overview of the evaluation metrics and analysis techniques.


page 17

page 18


A Survey of the Usages of Deep Learning in Natural Language Processing

Over the last several years, the field of natural language processing ha...

An Overview of Natural Language State Representation for Reinforcement Learning

A suitable state representation is a fundamental part of the learning pr...

Representations of language in a model of visually grounded speech signal

We present a visually grounded model of speech perception which projects...

Symbolic inductive bias for visually grounded learning of spoken language

A widespread approach to processing spoken language is to first automati...

An Attentive Survey of Attention Models

Attention Model has now become an important concept in neural networks t...

Deep Spoken Keyword Spotting: An Overview

Spoken keyword spotting (KWS) deals with the identification of keywords ...

Towards Understanding Sample Variance in Visually Grounded Language Generation: Evaluations and Observations

A major challenge in visually grounded language generation is to build r...