Teaching CLIP to Count to Ten

02/23/2023
by   Roni Paiss, et al.
0

Large vision-language models (VLMs), such as CLIP, learn rich joint image-text representations, facilitating advances in numerous downstream tasks, including zero-shot classification and text-to-image generation. Nevertheless, existing VLMs exhibit a prominent well-documented limitation - they fail to encapsulate compositional concepts such as counting. We introduce a simple yet effective method to improve the quantitative understanding of VLMs, while maintaining their overall performance on common benchmarks. Specifically, we propose a new counting-contrastive loss used to finetune a pre-trained VLM in tandem with its original objective. Our counting loss is deployed over automatically-created counterfactual examples, each consisting of an image and a caption containing an incorrect object count. For example, an image depicting three dogs is paired with the caption "Six dogs playing in the yard". Our loss encourages discrimination between the correct caption and its counterfactual variant which serves as a hard negative example. To the best of our knowledge, this work is the first to extend CLIP's capabilities to object counting. Furthermore, we introduce "CountBench" - a new image-text counting benchmark for evaluating a model's understanding of object counting. We demonstrate a significant improvement over state-of-the-art baseline models on this task. Finally, we leverage our count-aware CLIP model for image retrieval and text-conditioned image generation, demonstrating that our model can produce specific counts of objects more reliably than existing ones.

READ FULL TEXT

page 6

page 10

page 11

page 18

page 19

page 20

page 21

page 22

research
05/12/2023

CLIP-Count: Towards Text-Guided Zero-Shot Object Counting

Recent advances in visual-language models have shown remarkable zero-sho...
research
10/21/2022

Z-LaVI: Zero-Shot Language Solver Fueled by Visual Imagination

Large-scale pretrained language models have made significant advances in...
research
03/23/2023

CoBIT: A Contrastive Bi-directional Image-Text Generation Model

The field of vision and language has witnessed a proliferation of pre-tr...
research
06/02/2023

Open-world Text-specified Object Counting

Our objective is open-world object counting in images, where the target ...
research
11/21/2022

Teaching Structured Vision Language Concepts to Vision Language Models

Vision and Language (VL) models have demonstrated remarkable zero-shot p...
research
08/16/2023

Painter: Teaching Auto-regressive Language Models to Draw Sketches

Large language models (LLMs) have made tremendous progress in natural la...

Please sign up or login with your details

Forgot password? Click here to reset