See, Hear, and Read: Deep Aligned Representations

06/03/2017
by   Yusuf Aytar, et al.
0

We capitalize on large amounts of readily-available, synchronous data to learn a deep discriminative representations shared across three major natural modalities: vision, sound and language. By leveraging over a year of sound from video and millions of sentences paired with images, we jointly train a deep convolutional network for aligned representation learning. Our experiments suggest that this representation is useful for several tasks, such as cross-modal retrieval or transferring classifiers between modalities. Moreover, although our network is only trained with image+text and image+sound pairs, it can transfer between text and sound as well, a transfer the network never observed during training. Visualizations of our representation reveal many hidden units which automatically emerge to detect concepts, independent of the modality.

READ FULL TEXT

page 2

page 4

page 5

page 7

research
07/25/2016

Learning Aligned Cross-Modal Representations from Weakly Aligned Data

People can recognize scenes across many different modalities beyond natu...
research
10/27/2016

SoundNet: Learning Sound Representations from Unlabeled Video

We learn rich natural sound representations by capitalizing on large amo...
research
08/23/2018

Webly Supervised Joint Embedding for Cross-Modal Image-Text Retrieval

Cross-modal retrieval between visual data and natural language descripti...
research
06/27/2023

Semi-supervised Multimodal Representation Learning through a Global Workspace

Recent deep learning models can efficiently combine inputs from differen...
research
10/24/2022

Learning by Hallucinating: Vision-Language Pre-training with Weak Supervision

Weakly-supervised vision-language (V-L) pre-training (W-VLP) aims at lea...
research
02/28/2023

Joint Representations of Text and Knowledge Graphs for Retrieval and Evaluation

A key feature of neural models is that they can produce semantic vector ...
research
08/18/2022

Representation Learning for the Automatic Indexing of Sound Effects Libraries

Labeling and maintaining a commercial sound effects library is a time-co...

Please sign up or login with your details

Forgot password? Click here to reset