Deep Multi-Modal Image Correspondence Learning

12/05/2016
by   Chen Liu, et al.
0

Inference of correspondences between images from different modalities is an extremely important perceptual ability that enables humans to understand and recognize cross-modal concepts. In this paper, we consider an instance of this problem that involves matching photographs of building interiors with their corresponding floorplan. This is a particularly challenging problem because a floorplan, as a stylized architectural drawing, is very different in appearance from a color photograph. Furthermore, individual photographs by themselves depict only a part of a floorplan (e.g., kitchen, bathroom, and living room). We propose the use of a number of different neural network architectures for this task, which are trained and evaluated on a novel large-scale dataset of 5 million floorplan images and 80 million associated photographs. Experimental evaluation reveals that our neural network architectures are able to identify visual cues that result in reliable matches across these two quite different modalities. In fact, the trained networks are able to even outperform human subjects in several challenging image matching problems. Our result implies that neural networks are effective at perceptual tasks that require long periods of reasoning even for humans to solve.

READ FULL TEXT

page 1

page 3

page 4

page 7

page 8

page 9

page 10

research
09/12/2020

RGB2LIDAR: Towards Solving Large-Scale Cross-Modal Visual Localization

We study an important, yet largely unexplored problem of large-scale cro...
research
04/12/2018

Cross-Modal Retrieval with Implicit Concept Association

Traditional cross-modal retrieval assumes explicit association of concep...
research
01/13/2021

Probabilistic Embeddings for Cross-Modal Retrieval

Cross-modal retrieval methods build a common representation space for sa...
research
06/01/2021

Towards Efficient Cross-Modal Visual Textual Retrieval using Transformer-Encoder Deep Features

Cross-modal retrieval is an important functionality in modern search eng...
research
07/12/2018

Disjoint Mapping Network for Cross-modal Matching of Voices and Faces

We propose a novel framework, called Disjoint Mapping Network (DIMNet), ...
research
04/14/2016

Filling in the details: Perceiving from low fidelity images

Humans perceive their surroundings in great detail even though most of o...
research
10/27/2021

Perceptual Score: What Data Modalities Does Your Model Perceive?

Machine learning advances in the last decade have relied significantly o...

Please sign up or login with your details

Forgot password? Click here to reset