UniFine: A Unified and Fine-grained Approach for Zero-shot Vision-Language Understanding

07/03/2023
by   Rui Sun, et al.
0

Vision-language tasks, such as VQA, SNLI-VE, and VCR are challenging because they require the model's reasoning ability to understand the semantics of the visual world and natural language. Supervised methods working for vision-language tasks have been well-studied. However, solving these tasks in a zero-shot setting is less explored. Since Contrastive Language-Image Pre-training (CLIP) has shown remarkable zero-shot performance on image-text matching, previous works utilized its strong zero-shot ability by converting vision-language tasks into an image-text matching problem, and they mainly consider global-level matching (e.g., the whole image or sentence). However, we find visual and textual fine-grained information, e.g., keywords in the sentence and objects in the image, can be fairly informative for semantics understanding. Inspired by this, we propose a unified framework to take advantage of the fine-grained information for zero-shot vision-language learning, covering multiple tasks such as VQA, SNLI-VE, and VCR. Our experiments show that our framework outperforms former zero-shot methods on VQA and achieves substantial improvement on SNLI-VE and VCR. Furthermore, our ablation studies confirm the effectiveness and generalizability of our proposed method. Code will be available at https://github.com/ThreeSR/UniFine

READ FULL TEXT
research
10/17/2022

Plug-and-Play VQA: Zero-shot VQA by Conjoining Large Pretrained Models with Zero Training

Visual question answering (VQA) is a hallmark of vision and language rea...
research
11/26/2020

Fine-Grained Re-Identification

Research into the task of re-identification (ReID) is picking up momentu...
research
05/26/2023

Building One-class Detector for Anything: Open-vocabulary Zero-shot OOD Detection Using Text-image Models

We focus on the challenge of out-of-distribution (OOD) detection in deep...
research
04/07/2022

Unified Contrastive Learning in Image-Text-Label Space

Visual recognition is recently learned via either supervised learning on...
research
11/25/2022

ComCLIP: Training-Free Compositional Image and Text Matching

Contrastive Language-Image Pretraining (CLIP) has demonstrated great zer...
research
12/22/2022

When are Lemons Purple? The Concept Association Bias of CLIP

Large-scale vision-language models such as CLIP have shown impressive pe...
research
08/30/2023

AnoVL: Adapting Vision-Language Models for Unified Zero-shot Anomaly Localization

Contrastive Language-Image Pre-training (CLIP) models have shown promisi...

Please sign up or login with your details

Forgot password? Click here to reset