Do Vision-Language Pretrained Models Learn Primitive Concepts?

03/31/2022
by   Tian Yun, et al.
0

Vision-language pretrained models have achieved impressive performance on multimodal reasoning and zero-shot recognition tasks. Many of these VL models are pretrained on unlabeled image and caption pairs from the internet. In this paper, we study whether the notion of primitive concepts, such as color and shape attributes, emerges automatically from these pretrained VL models. We propose to learn compositional derivations that map primitive concept activations into composite concepts, a task which we demonstrate to be straightforward given true primitive concept annotations. This compositional derivation learning (CompDL) framework allows us to quantitively measure the usefulness and interpretability of the learned derivations, by jointly considering the entire set of candidate primitive concepts. Our study reveals that state-of-the-art VL pretrained models learn primitive concepts that are highly useful as visual descriptors, as demonstrated by their strong performance on fine-grained visual recognition tasks, but those concepts struggle to provide interpretable compositional derivations, which highlights limitations of existing VL models. Code and models will be released.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/10/2021

Relation-aware Compositional Zero-shot Learning for Attribute-Object Pair Recognition

This paper proposes a novel model for recognizing images with composite ...
research
11/09/2022

Prompting Large Pre-trained Vision-Language Models For Compositional Concept Learning

This work explores the zero-shot compositional learning ability of large...
research
11/25/2022

ComCLIP: Training-Free Compositional Image and Text Matching

Contrastive Language-Image Pretraining (CLIP) has demonstrated great zer...
research
05/23/2023

Prompting Language-Informed Distribution for Compositional Zero-Shot Learning

The compositional zero-shot learning (CZSL) task aims to recognize unsee...
research
09/08/2023

Compositional Learning of Visually-Grounded Concepts Using Reinforcement

Deep reinforcement learning agents need to be trained over millions of e...
research
04/23/2022

On Leveraging Variational Graph Embeddings for Open World Compositional Zero-Shot Learning

Humans are able to identify and categorize novel compositions of known c...
research
05/12/2020

Compositional Few-Shot Recognition with Primitive Discovery and Enhancing

Few-shot learning (FSL) aims at recognizing novel classes given only few...

Please sign up or login with your details

Forgot password? Click here to reset