Symbolic inductive bias for visually grounded learning of spoken language

12/21/2018
by   Grzegorz Chrupała, et al.
0

A widespread approach to processing spoken language is to first automatically transcribe it into text. An alternative is to use an end-to-end approach: recent works have proposed to learn semantic embeddings of spoken language from images with spoken captions, without an intermediate transcription step. We propose to use multitask learning to exploit existing transcribed speech within the end-to-end setting. We describe a three-task architecture which combines the objectives of matching spoken captions with corresponding images, speech with text, and text with images. We show that the addition of the speech/text task leads to substantial performance improvements when compared to training the speech/image task in isolation. We conjecture that this is due to a strong inductive bias transcribed speech provides to the model, and offer supporting evidence for this.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/26/2017

SPEECH-COCO: 600k Visually Grounded Spoken Captions Aligned to MSCOCO Data Set

This paper presents an augmentation of MSCOCO dataset where speech is ad...
research
03/23/2017

Visually grounded learning of keyword prediction from untranscribed speech

During language acquisition, infants have the benefit of visual cues to ...
research
12/31/2020

Text-Free Image-to-Speech Synthesis Using Learned Segmental Units

In this paper we present the first model for directly synthesizing fluen...
research
11/11/2015

Deep Multimodal Semantic Embeddings for Speech and Images

In this paper, we present a model which takes as input a corpus of image...
research
11/10/2017

Object Referring in Visual Scene with Spoken Language

Object referring has important applications, especially for human-machin...
research
10/23/2020

Show and Speak: Directly Synthesize Spoken Description of Images

This paper proposes a new model, referred to as the show and speak (SAS)...
research
09/19/2019

Large-scale representation learning from visually grounded untranscribed speech

Systems that can associate images with their spoken audio captions are a...

Please sign up or login with your details

Forgot password? Click here to reset