CLIP-Count: Towards Text-Guided Zero-Shot Object Counting

05/12/2023
by   Ruixiang Jiang, et al.
0

Recent advances in visual-language models have shown remarkable zero-shot text-image matching ability that is transferable to down-stream tasks such as object detection and segmentation. However, adapting these models for object counting, which involves estimating the number of objects in an image, remains a formidable challenge. In this study, we conduct the first exploration of transferring visual-language models for class-agnostic object counting. Specifically, we propose CLIP-Count, a novel pipeline that estimates density maps for open-vocabulary objects with text guidance in a zero-shot manner, without requiring any finetuning on specific object classes. To align the text embedding with dense image features, we introduce a patch-text contrastive loss that guides the model to learn informative patch-level image representations for dense prediction. Moreover, we design a hierarchical patch-text interaction module that propagates semantic information across different resolution levels of image features. Benefiting from the full exploitation of the rich image-text alignment knowledge of pretrained visual-language models, our method effectively generates high-quality density maps for objects-of-interest. Extensive experiments on FSC-147, CARPK, and ShanghaiTech crowd counting datasets demonstrate that our proposed method achieves state-of-the-art accuracy and generalizability for zero-shot object counting. Project page at https://github.com/songrise/CLIP-Count

READ FULL TEXT

page 3

page 4

page 7

page 8

research
03/03/2023

Zero-shot Object Counting

Class-agnostic object counting aims to count object instances of an arbi...
research
02/23/2023

Teaching CLIP to Count to Ten

Large vision-language models (VLMs), such as CLIP, learn rich joint imag...
research
11/15/2022

A Low-Shot Object Counting Network With Iterative Prototype Adaptation

We consider low-shot counting of arbitrary semantic categories in the im...
research
02/10/2023

GCNet: Probing Self-Similarity Learning for Generalized Counting Network

The class-agnostic counting (CAC) problem has caught increasing attentio...
research
06/02/2023

Open-world Text-specified Object Counting

Our objective is open-world object counting in images, where the target ...
research
05/23/2023

Parts of Speech-Grounded Subspaces in Vision-Language Models

Latent image representations arising from vision-language models have pr...
research
11/18/2021

Simple but Effective: CLIP Embeddings for Embodied AI

Contrastive language image pretraining (CLIP) encoders have been shown t...

Please sign up or login with your details

Forgot password? Click here to reset