Multi-Modal Few-Shot Temporal Action Detection

11/27/2022
by   Sauradip Nag, et al.
0

Few-shot (FS) and zero-shot (ZS) learning are two different approaches for scaling temporal action detection (TAD) to new classes. The former adapts a pretrained vision model to a new task represented by as few as a single video per class, whilst the latter requires no training examples by exploiting a semantic description of the new class. In this work, we introduce a new multi-modality few-shot (MMFS) TAD problem, which can be considered as a marriage of FS-TAD and ZS-TAD by leveraging few-shot support videos and new class names jointly. To tackle this problem, we further introduce a novel MUlti-modality PromPt mETa-learning (MUPPET) method. This is enabled by efficiently bridging pretrained vision and language models whilst maximally reusing already learned capacity. Concretely, we construct multi-modal prompts by mapping support videos into the textual token space of a vision-language model using a meta-learned adapter-equipped visual semantics tokenizer. To tackle large intra-class variation, we further design a query feature regulation scheme. Extensive experiments on ActivityNetv1.3 and THUMOS14 demonstrate that our MUPPET outperforms state-of-the-art alternative methods, often by a large margin. We also show that our MUPPET can be easily extended to tackle the few-shot object detection problem and again achieves the state-of-the-art performance on MS-COCO dataset. The code will be available in https://github.com/sauradip/MUPPET

READ FULL TEXT

page 1

page 3

research
06/16/2022

Zero-Shot Video Question Answering via Frozen Bidirectional Language Models

Video question answering (VideoQA) is a complex task that requires diver...
research
08/06/2021

Simpler is Better: Few-shot Semantic Segmentation with Classifier Weight Transformer

A few-shot semantic segmentation model is typically composed of a CNN en...
research
10/20/2021

Few-Shot Temporal Action Localization with Query Adaptive Transformer

Existing temporal action localization (TAL) works rely on a large number...
research
11/24/2022

Delving into Out-of-Distribution Detection with Vision-Language Representations

Recognizing out-of-distribution (OOD) samples is critical for machine le...
research
03/04/2023

Prismer: A Vision-Language Model with An Ensemble of Experts

Recent vision-language models have shown impressive multi-modal generati...
research
10/10/2022

Multi-Modal Fusion by Meta-Initialization

When experience is scarce, models may have insufficient information to a...
research
07/17/2022

Zero-Shot Temporal Action Detection via Vision-Language Prompting

Existing temporal action detection (TAD) methods rely on large training ...

Please sign up or login with your details

Forgot password? Click here to reset