Universal Multi-Modality Retrieval with One Unified Embedding Space

09/01/2022
by   Zhenghao Liu, et al.
0

This paper presents Vision-Language Universal Search (VL-UnivSearch), which builds a unified model for multi-modality retrieval. VL-UnivSearch encodes query and multi-modality sources in a universal embedding space for searching related candidates and routing modalities. To learn a tailored embedding space for multi-modality retrieval, VL-UnivSearch proposes two techniques: 1) Universal embedding optimization, which contrastively optimizes the embedding space using the modality-balanced hard negatives; 2) Image verbalization method, which bridges the modality gap between images and texts in the raw data space. VL-UnivSearch achieves the state-of-the-art on the multi-modality open-domain question answering benchmark, WebQA, and outperforms all retrieval models in each single modality task. It demonstrates that universal multi-modality search is feasible to replace the divide-and-conquer pipeline with a united model and also benefit per modality tasks. All source codes of this work will be released via Github.

READ FULL TEXT
research
11/18/2019

Modality To Modality Translation: An Adversarial Representation Learning and Graph Fusion Network for Multimodal Fusion

Learning joint embedding space for various modalities is of vital import...
research
04/24/2022

Progressive Learning for Image Retrieval with Hybrid-Modality Queries

Image retrieval with hybrid-modality queries, also known as composing te...
research
09/07/2023

ImageBind-LLM: Multi-modality Instruction Tuning

We present ImageBind-LLM, a multi-modality instruction tuning method of ...
research
12/16/2022

Enhancing Multi-modal and Multi-hop Question Answering via Structured Knowledge and Unified Retrieval-Generation

Multi-modal and multi-hop question answering aims to answer a question b...
research
07/21/2022

Grounding Visual Representations with Texts for Domain Generalization

Reducing the representational discrepancy between source and target doma...
research
02/08/2023

Diagnosing and Rectifying Vision Models using Language

Recent multi-modal contrastive learning models have demonstrated the abi...
research
03/27/2020

CurlingNet: Compositional Learning between Images and Text for Fashion IQ Data

We present an approach named CurlingNet that can measure the semantic di...

Please sign up or login with your details

Forgot password? Click here to reset