Cross-Modal Scene Networks

10/27/2016
by   Yusuf Aytar, et al.
0

People can recognize scenes across many different modalities beyond natural images. In this paper, we investigate how to learn cross-modal scene representations that transfer across modalities. To study this problem, we introduce a new cross-modal scene dataset. While convolutional neural networks can categorize scenes well, they also learn an intermediate representation not aligned across modalities, which is undesirable for cross-modal transfer applications. We present methods to regularize cross-modal convolutional neural networks so that they have a shared representation that is agnostic of the modality. Our experiments suggest that our scene representation can help transfer representations across modalities for retrieval. Moreover, our visualizations suggest that units emerge in the shared representation that tend to activate on consistent concepts independently of the modality.

READ FULL TEXT

page 2

page 3

page 4

page 7

page 8

page 9

page 10

research
07/25/2016

Learning Aligned Cross-Modal Representations from Weakly Aligned Data

People can recognize scenes across many different modalities beyond natu...
research
10/06/2019

Neural Multisensory Scene Inference

For embodied agents to infer representations of the underlying 3D physic...
research
11/21/2021

TraVLR: Now You See It, Now You Don't! Evaluating Cross-Modal Transfer of Visio-Linguistic Reasoning

Numerous visio-linguistic (V+L) representation learning methods have bee...
research
04/14/2021

Continual learning in cross-modal retrieval

Multimodal representations and continual learning are two areas closely ...
research
04/20/2017

Knowledge Fusion via Embeddings from Text, Knowledge Graphs, and Images

We present a baseline approach for cross-modal knowledge fusion. Differe...
research
07/31/2018

Deep Cross Modal Learning for Caricature Verification and Identification(CaVINet)

Learning from different modalities is a challenging task. In this paper,...
research
07/12/2018

Disjoint Mapping Network for Cross-modal Matching of Voices and Faces

We propose a novel framework, called Disjoint Mapping Network (DIMNet), ...

Please sign up or login with your details

Forgot password? Click here to reset