CoBIT: A Contrastive Bi-directional Image-Text Generation Model

03/23/2023
by   Haoxuan You, et al.
0

The field of vision and language has witnessed a proliferation of pre-trained foundation models. Most existing methods are independently pre-trained with contrastive objective like CLIP, image-to-text generative objective like PaLI, or text-to-image generative objective like Parti. However, the three objectives can be pre-trained on the same data, image-text pairs, and intuitively they complement each other as contrasting provides global alignment capacity and generation grants fine-grained understanding. In this work, we present a Contrastive Bi-directional Image-Text generation model (CoBIT), which attempts to unify the three pre-training objectives in one framework. Specifically, CoBIT employs a novel unicoder-decoder structure, consisting of an image unicoder, a text unicoder and a cross-modal decoder. The image/text unicoders can switch between encoding and decoding in different tasks, enabling flexibility and shared knowledge that benefits both image-to-text and text-to-image generations. CoBIT achieves superior performance in image understanding, image-text understanding (Retrieval, Captioning, VQA, SNLI-VE) and text-based content creation, particularly in zero-shot scenarios. For instance, 82.7 zero-shot text-to-image generation and 44.8 CIDEr in zero-shot captioning.

READ FULL TEXT

page 1

page 8

page 14

research
11/14/2022

Zero-shot Image Captioning by Anchor-augmented Vision-Language Space Alignment

CLIP (Contrastive Language-Image Pre-Training) has shown remarkable zero...
research
11/15/2021

LiT: Zero-Shot Transfer with Locked-image Text Tuning

This paper presents contrastive-tuning, a simple method employing contra...
research
07/29/2022

Curriculum Learning for Data-Efficient Vision-Language Alignment

Aligning image and text encoders from scratch using contrastive learning...
research
03/17/2022

DU-VLG: Unifying Vision-and-Language Generation via Dual Sequence-to-Sequence Pre-training

Due to the limitations of the model structure and pre-training objective...
research
02/23/2023

Teaching CLIP to Count to Ten

Large vision-language models (VLMs), such as CLIP, learn rich joint imag...
research
06/12/2023

Retrieval-Enhanced Contrastive Vision-Text Models

Contrastive image-text models such as CLIP form the building blocks of m...
research
05/18/2023

OpenShape: Scaling Up 3D Shape Representation Towards Open-World Understanding

We introduce OpenShape, a method for learning multi-modal joint represen...

Please sign up or login with your details

Forgot password? Click here to reset