Cross-Modal Retrieval with Implicit Concept Association

04/12/2018
by   Yale Song, et al.
0

Traditional cross-modal retrieval assumes explicit association of concepts across modalities, where there is no ambiguity in how the concepts are linked to each other, e.g., when we do the image search with a query "dogs", we expect to see dog images. In this paper, we consider a different setting for cross-modal retrieval where data from different modalities are implicitly linked via concepts that must be inferred by high-level reasoning; we call this setting implicit concept association. To foster future research in this setting, we present a new dataset containing 47K pairs of animated GIFs and sentences crawled from the web, in which the GIFs depict physical or emotional reactions to the scenarios described in the text (called "reaction GIFs"). We report on a user study showing that, despite the presence of implicit concept association, humans are able to identify video-sentence pairs with matching concepts, suggesting the feasibility of our task. Furthermore, we propose a novel visual-semantic embedding network based on multiple instance learning. Unlike traditional approaches, we compute multiple embeddings from each modality, each representing different concepts, and measure their similarity by considering all possible combinations of visual-semantic embeddings in the framework of multiple instance learning. We evaluate our approach on two video-sentence datasets with explicit and implicit concept association and report competitive results compared to existing approaches on cross-modal retrieval.

READ FULL TEXT

page 1

page 2

page 3

page 5

page 6

page 8

research
06/11/2019

Polysemous Visual-Semantic Embedding for Cross-Modal Retrieval

Visual-semantic embedding aims to find a shared latent space where relat...
research
10/26/2021

Learning Explicit and Implicit Latent Common Spaces for Audio-Visual Cross-Modal Retrieval

Learning common subspace is prevalent way in cross-modal retrieval to so...
research
12/05/2016

Deep Multi-Modal Image Correspondence Learning

Inference of correspondences between images from different modalities is...
research
08/15/2023

EMID: An Emotional Aligned Dataset in Audio-Visual Modality

In this paper, we propose Emotionally paired Music and Image Dataset (EM...
research
06/01/2017

Grounding Symbols in Multi-Modal Instructions

As robots begin to cohabit with humans in semi-structured environments, ...
research
10/16/2018

Cross-Modal and Hierarchical Modeling of Video and Text

Visual data and text data are composed of information at multiple granul...
research
07/07/2019

Informative Visual Storytelling with Cross-modal Rules

Existing methods in the Visual Storytelling field often suffer from the ...

Please sign up or login with your details

Forgot password? Click here to reset