DeepAI AI Chat
Log In Sign Up

Open-Vocabulary Temporal Action Detection with Off-the-Shelf Image-Text Features

by   Vivek Rathod, et al.

Detecting actions in untrimmed videos should not be limited to a small, closed set of classes. We present a simple, yet effective strategy for open-vocabulary temporal action detection utilizing pretrained image-text co-embeddings. Despite being trained on static images rather than videos, we show that image-text co-embeddings enable openvocabulary performance competitive with fully-supervised models. We show that the performance can be further improved by ensembling the image-text features with features encoding local motion, like optical flow based features, or other modalities, like audio. In addition, we propose a more reasonable open-vocabulary evaluation setting for the ActivityNet data set, where the category splits are based on similarity rather than random assignment.


page 1

page 3

page 16

page 17


Multi-modal Prompting for Low-Shot Temporal Action Localization

In this paper, we consider the problem of temporal action localization u...

Rethinking the Openness of CLIP

Contrastive Language-Image Pre-training (CLIP) has demonstrated great po...

OWL (Observe, Watch, Listen): Localizing Actions in Egocentric Video via Audiovisual Temporal Context

Temporal action localization (TAL) is an important task extensively expl...

Region-Aware Pretraining for Open-Vocabulary Object Detection with Vision Transformers

We present Region-aware Open-vocabulary Vision Transformers (RO-ViT) - a...

Open-Vocabulary DETR with Conditional Matching

Open-vocabulary object detection, which is concerned with the problem of...

Bridging the Gap between Object and Image-level Representations for Open-Vocabulary Detection

Existing open-vocabulary object detectors typically enlarge their vocabu...

Robust Open-Vocabulary Translation from Visual Text Representations

Machine translation models have discrete vocabularies and commonly use s...