Unifying Vision-Language Representation Space with Single-tower Transformer

11/21/2022
by   Jiho Jang, et al.
0

Contrastive learning is a form of distance learning that aims to learn invariant features from two related representations. In this paper, we explore the bold hypothesis that an image and its caption can be simply regarded as two different views of the underlying mutual information, and train a model to learn a unified vision-language representation space that encodes both modalities at once in a modality-agnostic manner. We first identify difficulties in learning a generic one-tower model for vision-language pretraining (VLP), and propose OneR as a simple yet effective framework for our goal. We discover intriguing properties that distinguish OneR from the previous works that learn modality-specific representation spaces such as zero-shot object localization, text-guided visual reasoning and multi-modal retrieval, and present analyses to provide insights into this new form of multi-modal representation learning. Thorough evaluations demonstrate the potential of a unified modality-agnostic VLP framework.

READ FULL TEXT

page 1

page 5

page 7

page 12

research
12/08/2021

Everything at Once – Multi-modal Fusion Transformer for Video Retrieval

Multi-modal learning from video data has seen increased attention recent...
research
03/03/2022

Mind the Gap: Understanding the Modality Gap in Multi-modal Contrastive Representation Learning

We present modality gap, an intriguing geometric phenomenon of the repre...
research
02/28/2022

Multi-modal Alignment using Representation Codebook

Aligning signals from different modalities is an important step in visio...
research
02/08/2023

Diagnosing and Rectifying Vision Models using Language

Recent multi-modal contrastive learning models have demonstrated the abi...
research
12/23/2016

DeMIAN: Deep Modality Invariant Adversarial Network

Obtaining common representations from different modalities is important ...
research
04/25/2022

SceneTrilogy: On Scene Sketches and its Relationship with Text and Photo

We for the first time extend multi-modal scene understanding to include ...
research
07/19/2023

Divert More Attention to Vision-Language Object Tracking

Multimodal vision-language (VL) learning has noticeably pushed the tende...

Please sign up or login with your details

Forgot password? Click here to reset