Does CLIP Bind Concepts? Probing Compositionality in Large Image Models

12/20/2022
by   Martha Lewis, et al.
0

Large-scale models combining text and images have made incredible progress in recent years. However, they can still fail at tasks requiring compositional knowledge, such as correctly picking out a red cube from a picture of multiple shapes. We examine the ability of CLIP (Radford et al., 2021), to caption images requiring compositional knowledge. We implement five compositional language models to probe the kinds of structure that CLIP may be using, and develop a novel training algorithm, Compositional Skipgram for Images (CoSI), to train these models. We look at performance in attribute-based tasks, requiring the identification of a particular combination of attribute and object (such as "red cube"), and in relational settings, where the spatial relation between two shapes (such as "cube behind sphere") must be identified. We find that in some conditions, CLIP is able to learn attribute-object labellings, and to generalize to unseen attribute-object combinations. However, we also see evidence that CLIP is not able to bind features together reliably. Moreover, CLIP is not able to reliably learn relations between objects, whereas some compositional models are able to learn these perfectly. Of the five models we developed, none were able to generalize to unseen relations.

READ FULL TEXT

page 5

page 6

page 7

research
06/16/2020

A Study of Compositional Generalization in Neural Models

Compositional and relational learning is a hallmark of human intelligenc...
research
12/20/2021

Translational Concept Embedding for Generalized Compositional Zero-shot Learning

Generalized compositional zero-shot learning means to learn composed con...
research
05/24/2021

Large-Scale Attribute-Object Compositions

We study the problem of learning how to predict attribute-object composi...
research
06/17/2023

Seen to Unseen: Exploring Compositional Generalization of Multi-Attribute Controllable Dialogue Generation

Existing controllable dialogue generation work focuses on the single-att...
research
06/06/2021

Planning Multimodal Exploratory Actions for Online Robot Attribute Learning

Robots frequently need to perceive object attributes, such as "red," "he...
research
12/18/2014

Semantic Part Segmentation using Compositional Model combining Shape and Appearance

In this paper, we study the problem of semantic part segmentation for an...
research
06/09/2019

Learning to Predict Novel Noun-Noun Compounds

We introduce temporally and contextually-aware models for the novel task...

Please sign up or login with your details

Forgot password? Click here to reset