ManagerTower: Aggregating the Insights of Uni-Modal Experts for Vision-Language Representation Learning

05/31/2023
by   Xiao Xu, et al.
0

Two-Tower Vision-Language (VL) models have shown promising improvements on various downstream VL tasks. Although the most advanced work improves performance by building bridges between encoders, it suffers from ineffective layer-by-layer utilization of uni-modal representations and cannot flexibly exploit different levels of uni-modal semantic knowledge. In this work, we propose ManagerTower, a novel VL model architecture that gathers and combines the insights of pre-trained uni-modal experts at different levels. The managers introduced in each cross-modal layer can adaptively aggregate uni-modal semantic knowledge to facilitate more comprehensive cross-modal alignment and fusion. ManagerTower outperforms previous strong baselines both with and without Vision-Language Pre-training (VLP). With only 4M VLP data, ManagerTower achieves superior performances on various downstream VL tasks, especially 79.15 Code and checkpoints are available at https://github.com/LooperXX/ManagerTower.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/17/2022

Bridge-Tower: Building Bridges Between Encoders in Vision-Language Representation Learning

Vision-Language (VL) models with the Two-Tower architecture have dominat...
research
07/11/2023

EgoVLPv2: Egocentric Video-Language Pre-training with Fusion in the Backbone

Video-language pre-training (VLP) has become increasingly important due ...
research
01/29/2022

MVPTR: Multi-Stage Vision-Language Pre-Training via Multi-Level Semantic Alignment

In this paper, we propose a Multi-stage Vision-language Pre-TRaining (MV...
research
07/26/2022

Learning Protein Representations via Complete 3D Graph Networks

We consider representation learning for proteins with 3D structures. We ...
research
05/23/2022

PEVL: Position-enhanced Pre-training and Prompt Tuning for Vision-language Models

Vision-language pre-training (VLP) has shown impressive performance on a...
research
07/01/2022

VL-CheckList: Evaluating Pre-trained Vision-Language Models with Objects, Attributes and Relations

Vision-Language Pretraining (VLP) models have recently successfully faci...
research
06/06/2023

MolFM: A Multimodal Molecular Foundation Model

Molecular knowledge resides within three different modalities of informa...

Please sign up or login with your details

Forgot password? Click here to reset