Meta-Personalizing Vision-Language Models to Find Named Instances in Video

06/16/2023
by   Chun-Hsiao Yeh, et al.
0

Large-scale vision-language models (VLM) have shown impressive results for language-guided search applications. While these models allow category-level queries, they currently struggle with personalized searches for moments in a video where a specific object instance such as “My dog Biscuit” appears. We present the following three contributions to address this problem. First, we describe a method to meta-personalize a pre-trained VLM, i.e., learning how to learn to personalize a VLM at test time to search in video. Our method extends the VLM's token vocabulary by learning novel word embeddings specific to each instance. To capture only instance-specific features, we represent each instance embedding as a combination of shared and learned global category features. Second, we propose to learn such personalization without explicit human supervision. Our approach automatically identifies moments of named visual instances in video using transcripts and vision-language similarity in the VLM's embedding space. Finally, we introduce This-Is-My, a personal video instance retrieval benchmark. We evaluate our approach on This-Is-My and DeepFashion2 and show that we obtain a 15 of the art on the latter dataset.

READ FULL TEXT

page 1

page 3

page 4

page 6

page 8

page 11

page 14

page 15

research
03/25/2023

Learning video embedding space with Natural Language Supervision

The recent success of the CLIP model has shown its potential to be appli...
research
05/23/2016

Generic Instance Search and Re-identification from One Example via Attributes and Categories

This paper aims for generic instance search from one example where the i...
research
04/16/2022

Unsupervised Attention-based Sentence-Level Meta-Embeddings from Contextualised Language Models

A variety of contextualised language models have been proposed in the NL...
research
04/04/2022

"This is my unicorn, Fluffy": Personalizing frozen vision-language representations

Large Vision Language models pretrained on web-scale data provide re...
research
08/02/2023

Rethinking Similarity Search: Embracing Smarter Mechanisms over Smarter Data

In this vision paper, we propose a shift in perspective for improving th...
research
05/23/2023

Grounding and Distinguishing Conceptual Vocabulary Through Similarity Learning in Embodied Simulations

We present a novel method for using agent experiences gathered through a...
research
04/06/2019

Unsupervised Embedding Learning via Invariant and Spreading Instance Feature

This paper studies the unsupervised embedding learning problem, which re...

Please sign up or login with your details

Forgot password? Click here to reset