Parts of Speech-Grounded Subspaces in Vision-Language Models

05/23/2023
by   James Oldfield, et al.
0

Latent image representations arising from vision-language models have proved immensely useful for a variety of downstream tasks. However, their utility is limited by their entanglement with respect to different visual attributes. For instance, recent work has shown that CLIP image representations are often biased toward specific visual properties (such as objects or actions) in an unpredictable manner. In this paper, we propose to separate representations of the different visual modalities in CLIP's joint vision-language space by leveraging the association between parts of speech and specific visual modes of variation (e.g. nouns relate to objects, adjectives describe appearance). This is achieved by formulating an appropriate component analysis model that learns subspaces capturing variability corresponding to a specific part of speech, while jointly minimising variability to the rest. Such a subspace yields disentangled representations of the different visual properties of an image or text in closed form while respecting the underlying geometry of the manifold on which the representations lie. What's more, we show the proposed model additionally facilitates learning subspaces corresponding to specific visual appearances (e.g. artists' painting styles), which enables the selective removal of entire visual themes from CLIP-based text-to-image synthesis. We validate the model both qualitatively, by visualising the subspace projections with a text-to-image model and by preventing the imitation of artists' styles, and quantitatively, through class invariance metrics and improvements to baseline zero-shot classification. Our code is available at: https://github.com/james-oldfield/PoS-subspaces.

READ FULL TEXT

page 6

page 7

page 13

page 14

page 15

page 16

page 18

page 22

research
04/26/2023

From Association to Generation: Text-only Captioning by Unsupervised Cross-modal Mapping

With the development of Vision-Language Pre-training Models (VLPMs) repr...
research
08/03/2023

Get the Best of Both Worlds: Improving Accuracy and Transferability by Grassmann Class Representation

We generalize the class vectors found in neural networks to linear subsp...
research
05/12/2023

CLIP-Count: Towards Text-Guided Zero-Shot Object Counting

Recent advances in visual-language models have shown remarkable zero-sho...
research
03/18/2023

DeAR: Debiasing Vision-Language Models with Additive Residuals

Large pre-trained vision-language models (VLMs) reduce the time for deve...
research
07/13/2023

Bootstrapping Vision-Language Learning with Decoupled Language Pre-training

We present a novel methodology aimed at optimizing the application of fr...
research
06/19/2023

Renderers are Good Zero-Shot Representation Learners: Exploring Diffusion Latents for Metric Learning

Can the latent spaces of modern generative neural rendering models serve...
research
06/13/2022

Compositional Mixture Representations for Vision and Text

Learning a common representation space between vision and language allow...

Please sign up or login with your details

Forgot password? Click here to reset