Deep Multimodal Semantic Embeddings for Speech and Images

11/11/2015
by   David Harwath, et al.
0

In this paper, we present a model which takes as input a corpus of images with relevant spoken captions and finds a correspondence between the two modalities. We employ a pair of convolutional neural networks to model visual objects and speech signals at the word level, and tie the networks together with an embedding and alignment model which learns a joint semantic space over both modalities. We evaluate our model using image search and annotation tasks on the Flickr8k dataset, which we augmented by collecting a corpus of 40,000 spoken captions using Amazon Mechanical Turk.

READ FULL TEXT

page 2

page 3

page 6

research
07/26/2017

SPEECH-COCO: 600k Visually Grounded Spoken Captions Aligned to MSCOCO Data Set

This paper presents an augmentation of MSCOCO dataset where speech is ad...
research
12/31/2020

Text-Free Image-to-Speech Synthesis Using Learned Segmental Units

In this paper we present the first model for directly synthesizing fluen...
research
12/21/2018

Symbolic inductive bias for visually grounded learning of spoken language

A widespread approach to processing spoken language is to first automati...
research
01/25/2017

Learning Word-Like Units from Joint Audio-Visual Analysis

Given a collection of images and spoken audio captions, we present a met...
research
07/25/2022

ConceptBeam: Concept Driven Target Speech Extraction

We propose a novel framework for target speech extraction based on seman...
research
08/18/2023

Predictive Authoring for Brazilian Portuguese Augmentative and Alternative Communication

Individuals with complex communication needs (CCN) often rely on augment...
research
12/10/2020

Direct multimodal few-shot learning of speech and images

We propose direct multimodal few-shot models that learn a shared embeddi...

Please sign up or login with your details

Forgot password? Click here to reset