UniBoost: Unsupervised Unimodal Pre-training for Boosting Zero-shot Vision-Language Tasks

06/07/2023
by   Yanan Sun, et al.
0

Large-scale joint training of multimodal models, e.g., CLIP, have demonstrated great performance in many vision-language tasks. However, image-text pairs for pre-training are restricted to the intersection of images and texts, limiting their ability to cover a large distribution of real-world data, where noise can also be introduced as misaligned pairs during pre-processing. Conversely, unimodal models trained on text or image data alone through unsupervised techniques can achieve broader coverage of diverse real-world data and are not constrained by the requirement of simultaneous presence of image and text. In this paper, we demonstrate that using large-scale unsupervised unimodal models as pre-training can enhance the zero-shot performance of image-text pair models. Our thorough studies validate that models pre-trained as such can learn rich representations of both modalities, improving their ability to understand how images and text relate to each other. Our experiments show that unimodal pre-training outperforms state-of-the-art CLIP-based models by 6.5 PASCAL-5^i and 6.2 segmentation under zero-shot setting respectively. By learning representations of both modalities, unimodal pre-training offers broader coverage, reduced misalignment errors, and the ability to capture more complex features and patterns in the real-world data resulting in better performance especially for zero-shot vision-language tasks.

READ FULL TEXT

page 12

page 13

research
01/30/2023

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

The cost of vision-and-language pre-training has become increasingly pro...
research
06/14/2022

ReCo: Retrieve and Co-segment for Zero-shot Transfer

Semantic segmentation has a broad range of applications, but its real-wo...
research
09/08/2022

FETA: Towards Specializing Foundation Models for Expert Task Applications

Foundation Models (FMs) have demonstrated unprecedented capabilities inc...
research
06/02/2022

Prefix Conditioning Unifies Language and Label Supervision

Vision-language contrastive learning suggests a new learning paradigm by...
research
01/07/2023

Transferring Pre-trained Multimodal Representations with Cross-modal Similarity Matching

Despite surprising performance on zero-shot transfer, pre-training a lar...
research
05/21/2022

Life after BERT: What do Other Muppets Understand about Language?

Existing pre-trained transformer analysis works usually focus only on on...
research
11/21/2022

Teaching Structured Vision Language Concepts to Vision Language Models

Vision and Language (VL) models have demonstrated remarkable zero-shot p...

Please sign up or login with your details

Forgot password? Click here to reset