Polysemous Visual-Semantic Embedding for Cross-Modal Retrieval

06/11/2019
by   Yale Song, et al.
5

Visual-semantic embedding aims to find a shared latent space where related visual and textual instances are close to each other. Most current methods learn injective embedding functions that map an instance to a single point in the shared space. Unfortunately, injective embedding cannot effectively handle polysemous instances with multiple possible meanings; at best, it would find an average representation of different meanings. This hinders its use in real-world scenarios where individual instances and their cross-modal associations are often ambiguous. In this work, we introduce Polysemous Instance Embedding Networks (PIE-Nets) that compute multiple and diverse representations of an instance by combining global context with locally-guided features via multi-head self-attention and residual learning. To learn visual-semantic embedding, we tie-up two PIE-Nets and optimize them jointly in the multiple instance learning framework. Most existing work on cross-modal retrieval focuses on image-text data. Here, we also tackle a more challenging case of video-text retrieval. To facilitate further research in video-text retrieval, we release a new dataset of 50K video-sentence pairs collected from social media, dubbed MRW (my reaction when). We demonstrate our approach on both image-text and video-text retrieval scenarios using MS-COCO, TGIF, and our new MRW dataset.

READ FULL TEXT

page 2

page 3

page 5

page 8

page 9

page 12

page 13

page 14

research
11/17/2017

Look, Imagine and Match: Improving Textual-Visual Cross-Modal Retrieval with Generative Models

Textual-visual cross-modal retrieval has been a hot research topic in bo...
research
04/12/2018

Cross-Modal Retrieval with Implicit Concept Association

Traditional cross-modal retrieval assumes explicit association of concep...
research
03/29/2021

Memory Enhanced Embedding Learning for Cross-Modal Video-Text Retrieval

Cross-modal video-text retrieval, a challenging task in the field of vis...
research
08/08/2021

OVIS: Open-Vocabulary Visual Instance Search via Visual-Semantic Aligned Representation Learning

We introduce the task of open-vocabulary visual instance search (OVIS). ...
research
11/21/2022

Are All Combinations Equal? Combining Textual and Visual Features with Multiple Space Learning for Text-Based Video Retrieval

In this paper we tackle the cross-modal video retrieval problem and, mor...
research
02/24/2023

Deep Learning for Video-Text Retrieval: a Review

Video-Text Retrieval (VTR) aims to search for the most relevant video re...
research
01/11/2020

MHSAN: Multi-Head Self-Attention Network for Visual Semantic Embedding

Visual-semantic embedding enables various tasks such as image-text retri...

Please sign up or login with your details

Forgot password? Click here to reset