UFO: A UniFied TransfOrmer for Vision-Language Representation Learning

11/19/2021
by   Jianfeng Wang, et al.
0

In this paper, we propose a single UniFied transfOrmer (UFO), which is capable of processing either unimodal inputs (e.g., image or language) or multimodal inputs (e.g., the concatenation of the image and the question), for vision-language (VL) representation learning. Existing approaches typically design an individual network for each modality and/or a specific fusion network for multimodal tasks. To simplify the network architecture, we use a single transformer network and enforce multi-task learning during VL pre-training, which includes the image-text contrastive loss, image-text matching loss, and masked language modeling loss based on the bidirectional and the seq2seq attention mask. The same transformer network is used as the image encoder, the text encoder, or the fusion network in different pre-training tasks. Empirically, we observe less conflict among different tasks and achieve new state of the arts on visual question answering, COCO image captioning (cross-entropy optimization) and nocaps (in SPICE). On other downstream tasks, e.g., image-text retrieval, we also achieve competitive performance.

READ FULL TEXT
research
08/23/2023

EVE: Efficient Vision-Language Pre-training with Masked Prediction and Modality-Aware MoE

Building scalable vision-language models to learn from diverse, multimod...
research
05/27/2022

GIT: A Generative Image-to-text Transformer for Vision and Language

In this paper, we design and train a Generative Image-to-text Transforme...
research
12/15/2022

Image-and-Language Understanding from Pixels Only

Multimodal models are becoming increasingly effective, in part due to un...
research
08/03/2023

Multimodal Neurons in Pretrained Text-Only Transformers

Language models demonstrate remarkable capacity to generalize representa...
research
09/28/2022

TVLT: Textless Vision-Language Transformer

In this work, we present the Textless Vision-Language Transformer (TVLT)...
research
03/20/2021

Paying Attention to Multiscale Feature Maps in Multimodal Image Matching

We propose an attention-based approach for multimodal image patch matchi...
research
05/01/2020

HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training

We present HERO, a Hierarchical EncodeR for Omni-representation learning...

Please sign up or login with your details

Forgot password? Click here to reset