Universal Multi-Modality Retrieval with One Unified Embedding Space
This paper presents Vision-Language Universal Search (VL-UnivSearch), which builds a unified model for multi-modality retrieval. VL-UnivSearch encodes query and multi-modality sources in a universal embedding space for searching related candidates and routing modalities. To learn a tailored embedding space for multi-modality retrieval, VL-UnivSearch proposes two techniques: 1) Universal embedding optimization, which contrastively optimizes the embedding space using the modality-balanced hard negatives; 2) Image verbalization method, which bridges the modality gap between images and texts in the raw data space. VL-UnivSearch achieves the state-of-the-art on the multi-modality open-domain question answering benchmark, WebQA, and outperforms all retrieval models in each single modality task. It demonstrates that universal multi-modality search is feasible to replace the divide-and-conquer pipeline with a united model and also benefit per modality tasks. All source codes of this work will be released via Github.
READ FULL TEXT