MAtch, eXpand and Improve: Unsupervised Finetuning for Zero-Shot Action Recognition with Language Knowledge

03/15/2023
by   Wei Lin, et al.
0

Large scale Vision-Language (VL) models have shown tremendous success in aligning representations between visual and text modalities. This enables remarkable progress in zero-shot recognition, image generation editing, and many other exciting tasks. However, VL models tend to over-represent objects while paying much less attention to verbs, and require additional tuning on video data for best zero-shot action recognition performance. While previous work relied on large-scale, fully-annotated data, in this work we propose an unsupervised approach. We adapt a VL model for zero-shot and few-shot action recognition using a collection of unlabeled videos and an unpaired action dictionary. Based on that, we leverage Large Language Models and VL models to build a text bag for each unlabeled video via matching, text expansion and captioning. We use those bags in a Multiple Instance Learning setup to adapt an image-text backbone to video data. Although finetuned on unlabeled video data, our resulting models demonstrate high transferability to numerous unseen zero-shot downstream tasks, improving the base VL model performance by up to 14%, and even comparing favorably to fully-supervised baselines in both zero-shot and few-shot video recognition transfer. The code will be released later at <https://github.com/wlin-at/MAXI>.

READ FULL TEXT

page 1

page 3

page 10

page 11

page 12

research
03/24/2022

FitCLIP: Refining Large-Scale Pretrained Image-Text Models for Zero-Shot Video Understanding Tasks

Large-scale pretrained image-text models have shown incredible zero-shot...
research
03/10/2020

Learning Video Object Segmentation from Unlabeled Videos

We propose a new method for video object segmentation (VOS) that address...
research
04/06/2023

Vita-CLIP: Video and text adaptive CLIP via Multimodal Prompting

Adopting contrastive image-text pretrained models like CLIP towards vide...
research
08/11/2023

ZYN: Zero-Shot Reward Models with Yes-No Questions

In this work, we address the problem of directing the text generations o...
research
10/16/2018

Cross-Modal and Hierarchical Modeling of Video and Text

Visual data and text data are composed of information at multiple granul...
research
02/26/2020

Evolving Losses for Unsupervised Video Representation Learning

We present a new method to learn video representations from large-scale ...
research
02/12/2022

Semantic-Oriented Unlabeled Priming for Large-Scale Language Models

Due to the high costs associated with finetuning large language models, ...

Please sign up or login with your details

Forgot password? Click here to reset