SceneTrilogy: On Scene Sketches and its Relationship with Text and Photo

04/25/2022
by   Pinaki Nath Chowdhury, et al.
0

We for the first time extend multi-modal scene understanding to include that of free-hand scene sketches. This uniquely results in a trilogy of scene data modalities (sketch, text, and photo), where each offers unique perspectives for scene understanding, and together enable a series of novel scene-specific applications across discriminative (retrieval) and generative (captioning) tasks. Our key objective is to learn a common three-way embedding space that enables many-to-many modality interactions (e.g, sketch+text → photo retrieval). We importantly leverage the information bottleneck theory to achieve this goal, where we (i) decouple intra-modality information by minimising the mutual information between modality-specific and modality-agnostic components via a conditional invertible neural network, and (ii) align cross-modalities information by maximising the mutual information between their modality-agnostic components using InfoNCE, with a specific multihead attention mechanism to allow many-to-many modality interactions. We spell out a few insights on the complementarity of each modality for scene understanding, and study for the first time a series of scene-specific applications like joint sketch- and text-based image retrieval, sketch captioning.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/19/2022

Cross-Modal Fusion Distillation for Fine-Grained Sketch-Based Image Retrieval

Representation learning for sketch-based image retrieval has mostly been...
research
03/04/2022

FS-COCO: Towards Understanding of Freehand Sketches of Common Objects in Context

We advance sketch research to scenes with the first dataset of freehand ...
research
05/05/2021

Mixing Modalities of 3D Sketching and Speech for Interactive Model Retrieval in Virtual Reality

Sketch and speech are intuitive interaction methods that convey compleme...
research
03/29/2021

StyleMeUp: Towards Style-Agnostic Sketch-Based Image Retrieval

Sketch-based image retrieval (SBIR) is a cross-modal matching problem wh...
research
11/21/2022

Unifying Vision-Language Representation Space with Single-tower Transformer

Contrastive learning is a form of distance learning that aims to learn i...
research
08/07/2018

SketchyScene: Richly-Annotated Scene Sketches

We contribute the first large-scale dataset of scene sketches, SketchySc...
research
05/19/2023

MaGIC: Multi-modality Guided Image Completion

The vanilla image completion approaches are sensitive to the large missi...

Please sign up or login with your details

Forgot password? Click here to reset