Learning Modality-Invariant Representations for Speech and Images

12/11/2017
by   Kenneth Leidal, et al.
0

In this paper, we explore the unsupervised learning of a semantic embedding space for co-occurring sensory inputs. Specifically, we focus on the task of learning a semantic vector space for both spoken and handwritten digits using the TIDIGITs and MNIST datasets. Current techniques encode image and audio/textual inputs directly to semantic embeddings. In contrast, our technique maps an input to the mean and log variance vectors of a diagonal Gaussian from which sample semantic embeddings are drawn. In addition to encouraging semantic similarity between co-occurring inputs,our loss function includes a regularization term borrowed from variational autoencoders (VAEs) which drives the posterior distributions over embeddings to be unit Gaussian. We can use this regularization term to filter out modality information while preserving semantic information. We speculate this technique may be more broadly applicable to other areas of cross-modality/domain information retrieval and transfer learning.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/01/2020

Transcription-Enriched Joint Embeddings for Spoken Descriptions of Images and Videos

In this work, we propose an effective approach for training unique embed...
research
11/05/2017

Learning Word Embeddings from Speech

In this paper, we propose a novel deep neural network architecture, Sequ...
research
11/14/2019

HUSE: Hierarchical Universal Semantic Embeddings

There is a recent surge of interest in cross-modal representation learni...
research
06/02/2021

Exploring modality-agnostic representations for music classification

Music information is often conveyed or recorded across multiple data mod...
research
06/17/2015

Learning Contextualized Semantics from Co-occurring Terms via a Siamese Architecture

One of the biggest challenges in Multimedia information retrieval and un...
research
07/25/2022

ConceptBeam: Concept Driven Target Speech Extraction

We propose a novel framework for target speech extraction based on seman...

Please sign up or login with your details

Forgot password? Click here to reset