Prompt Pre-Training with Twenty-Thousand Classes for Open-Vocabulary Visual Recognition

04/10/2023
by   Shuhuai Ren, et al.
5

This work proposes POMP, a prompt pre-training method for vision-language models. Being memory and computation efficient, POMP enables the learned prompt to condense semantic information for a rich set of visual concepts with over twenty-thousand classes. Once pre-trained, the prompt with a strong transferable ability can be directly plugged into a variety of visual recognition tasks including image classification, semantic segmentation, and object detection, to boost recognition performances in a zero-shot manner. Empirical evaluation shows that POMP achieves state-of-the-art performances on 21 downstream datasets, e.g., 67.0 dataset (+3.1 segmentation (+6.9 compared to ZSSeg).

READ FULL TEXT

Please sign up or login with your details

Forgot password? Click here to reset