"This is my unicorn, Fluffy": Personalizing frozen vision-language representations

04/04/2022
by   Niv Cohen, et al.
0

Large Vision Language models pretrained on web-scale data provide representations that are invaluable for numerous V L problems. However, it is unclear how they can be used for reasoning about user-specific visual concepts in unstructured language. This problem arises in multiple domains, from personalized image retrieval to personalized interaction with smart devices. We introduce a new learning setup called Personalized Vision Language (PerVL) with two new benchmark datasets for retrieving and segmenting user-specific "personalized" concepts "in the wild". In PerVL, one should learn personalized concepts (1) independently of the downstream task (2) allowing a pretrained model to reason about them with free language, and (3) does not require personalized negative examples. We propose an architecture for solving PerVL that operates by extending the input vocabulary of a pretrained model with new word embeddings for the new personalized concepts. The model can then reason about them by simply using them in a sentence. We demonstrate that our approach learns personalized visual concepts from a few examples and can effectively apply them in image retrieval and semantic segmentation using rich textual queries.

READ FULL TEXT

page 2

page 8

page 10

page 20

page 21

page 22

page 25

research
11/11/2020

Exploring the Value of Personalized Word Embeddings

In this paper, we introduce personalized word embeddings, and examine th...
research
06/09/2020

Examination and Extension of Strategies for Improving Personalized Language Modeling via Interpolation

In this paper, we detail novel strategies for interpolating personalized...
research
06/16/2023

Meta-Personalizing Vision-Language Models to Find Named Instances in Video

Large-scale vision-language models (VLM) have shown impressive results f...
research
10/06/2020

Compositional Demographic Word Embeddings

Word embeddings are usually derived from corpora containing text from ma...
research
04/12/2018

Solving Bongard Problems with a Visual Language and Pragmatic Reasoning

More than 50 years ago Bongard introduced 100 visual concept learning pr...
research
10/04/2020

Multi-Modal Retrieval using Graph Neural Networks

Most real world applications of image retrieval such as Adobe Stock, whi...
research
03/03/2023

Word-As-Image for Semantic Typography

A word-as-image is a semantic typography technique where a word illustra...

Please sign up or login with your details

Forgot password? Click here to reset