Perceptual Grouping in Vision-Language Models

10/18/2022
by   Kanchana Ranasinghe, et al.
3

Recent advances in zero-shot image recognition suggest that vision-language models learn generic visual representations with a high degree of semantic information that may be arbitrarily probed with natural language phrases. Understanding an image, however, is not just about understanding what content resides within an image, but importantly, where that content resides. In this work we examine how well vision-language models are able to understand where objects reside within an image and group together visually related parts of the imagery. We demonstrate how contemporary vision and language representation learning models based on contrastive losses and large web-based data capture limited object localization information. We propose a minimal set of modifications that results in models that uniquely learn both semantic and spatial information. We measure this performance in terms of zero-shot image recognition, unsupervised bottom-up and top-down semantic segmentations, as well as robustness analyses. We find that the resulting model achieves state-of-the-art results in terms of unsupervised segmentation, and demonstrate that the learned representations are uniquely robust to spurious correlations in datasets designed to probe the causal behavior of vision models.

READ FULL TEXT

page 2

page 6

page 8

page 16

page 24

research
04/07/2022

Unsupervised Prompt Learning for Vision-Language Models

Contrastive vision-language models like CLIP have shown great progress i...
research
07/26/2021

Language Models as Zero-shot Visual Semantic Learners

Visual Semantic Embedding (VSE) models, which map images into a rich sem...
research
06/10/2023

EventCLIP: Adapting CLIP for Event-based Object Recognition

Recent advances in 2D zero-shot and few-shot recognition often leverage ...
research
04/05/2023

What's in a Name? Beyond Class Indices for Image Recognition

Existing machine learning models demonstrate excellent performance in im...
research
06/05/2023

Visually-Grounded Descriptions Improve Zero-Shot Image Classification

Language-vision models like CLIP have made significant progress in zero-...
research
07/26/2023

ECO: Ensembling Context Optimization for Vision-Language Models

Image recognition has recently witnessed a paradigm shift, where vision-...
research
12/29/2016

Learning Visual N-Grams from Web Data

Real-world image recognition systems need to recognize tens of thousands...

Please sign up or login with your details

Forgot password? Click here to reset