Prompting Visual-Language Models for Efficient Video Understanding

Visual-language pre-training has shown great success for learning joint visual-textual representations from large-scale web data, demonstrating remarkable ability for zero-shot generalisation. This paper presents a simple method to efficiently adapt one pre-trained visual-language model to novel tasks with minimal training, and here, we consider video understanding tasks. Specifically, we propose to optimise a few random vectors, termed as continuous prompt vectors, that convert the novel tasks into the same format as the pre-training objectives. In addition, to bridge the gap between static images and videos, temporal information is encoded with lightweight Transformers stacking on top of frame-wise visual features. Experimentally, we conduct extensive ablation studies to analyse the critical components and necessities. On 9 public benchmarks of action recognition, action localisation, and text-video retrieval, across closed-set, few-shot, open-set scenarios, we achieve competitive or state-of-the-art performance to existing methods, despite training significantly fewer parameters.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 3

04/26/2022

MILES: Visual BERT Pre-training with Injected Language Semantics for Video-text Retrieval

Dominant pre-training work for video-text retrieval mainly adopt the "du...
05/02/2019

Large-scale weakly-supervised pre-training for video action recognition

Current fully-supervised video datasets consist of only a few hundred th...
11/19/2021

Advancing High-Resolution Video-Language Representation with Large-Scale Video Transcriptions

We study joint video and language (VL) pre-training to enable cross-moda...
11/24/2021

NÜWA: Visual Synthesis Pre-training for Neural visUal World creAtion

This paper presents a unified multimodal pre-trained model called NÜWA t...
09/17/2021

ActionCLIP: A New Paradigm for Video Action Recognition

The canonical approach to video action recognition dictates a neural mod...
01/13/2022

BridgeFormer: Bridging Video-text Retrieval with Multiple Choice Questions

Pre-training a model to learn transferable video-text representation for...
03/22/2022

HOP: History-and-Order Aware Pre-training for Vision-and-Language Navigation

Pre-training has been adopted in a few of recent works for Vision-and-La...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

While research in computer vision has mainly focused on tackling particular tasks, the grand goal towards human-level perception has always been to learn general-purpose visual representation, that can solve various problems with

minimal tunings. Towards such goal, recent work for training visual-language models has shown promising progress. For example, CLIP [55] and ALIGN [27] learn a joint representation for image and text with simple noise contrastive learning, greatly benefiting from the rich information in text descriptions, e.g. actions, objects, human-object interactions, and object-object relationships. As a result, these visual-language models have demonstrated remarkable “zero-shot” generalisation for various image classification tasks. Crucially, the data used to train these powerful visual-language models can simply be crawled from the Internet at scale, without any laborious manual annotation. It is therefore reasonable to believe, with the growing computation, larger datasets will be collected, and more powerful models will be trained in the near future.

Given the promise, one question naturally arises: how can we best exploit the ability in these powerful visual-language models, and effectively adapt it to solve particular novel vision tasks that are of interest? One possible solution would be to finetune the image encoder end-to-end on the considered downstream tasks, however, as each downstream task will need to finetune and save its own set of parameters, we end up developing hundreds of models for hundreds of individual tasks. Even more problematic, discarding the text encoder loses the model’s ability for “zero-shot” generalisation, thus the resultant model can only work for a fixed set of pre-determined categories. Alternatively, as shown in CLIP [55]

, given properly designed “prompts”, the model is able to work on a variety of downstream tasks, with the classifiers being dynamically generated by the text encoder, from category names or other free-form texts. The prompts here are handcrafted cloze templates to facilitate classifier generation, so that downstream visual tasks can be formulated in the same format as the pre-training objectives, effectively closing the gap between pre-training and downstream tasks. One remaining issue is, such handcrafted prompts require extensive expert knowledge and labor, limiting its use for efficient task adaptation.

In this paper, we continue in the vein of prompt-based learning, with the goal of exploring an efficient way to adapt a visual-language model for novel tasks. We here consider a simple idea by prepending / appending a sequence of random vectors, which are termed as “continuous prompt vectors”, to the textual input. These prompt vectors consist entirely of free parameters that do not correspond to any real concrete words, and the subsequent layers of the text encoder will attend these vectors, as if they were a sequence of “virtual tokens” to generate the classifier or embedding. Although the weights of text encoder are kept frozen, the gradients are back-propagated through it to optimise the trainable prompt vectors. Consequently, a single copy of the visual backbone is able to perform various video understanding tasks, with the minimal number of trainable parameters for each task, i.e., only a few prompt vectors.

To summarise, building on a scalable, powerful image-based visual-language model, we first propose a simple idea for the efficient and lightweight model adaptation, through learning task-specific prompt vectors. And here, we consider video understanding as the downstream tasks, e.g. action recognition, action localisation, text-video retrieval; In particular, we formulate the problem of action recognition and retrieval under the same umbrella, that is, to maximise the similarity matching between visual and textual embeddings, with texts being action labels or fine-grained descriptions respectively; We extensively analyse several critical components of our method, e.g. the number of prompt vectors, usefulness of temporal modeling; Lastly, we evaluate the adaptation idea on public video benchmarks, across closed-set, few-shot, and open-set scenarios. Despite training only a few free parameters, i.e. several prompt vectors and two Transformer layers, we are able to achieve competitive or state-of-the-art performance to existing methods. In few-shot and open-set scenarios, we significantly outperform existing methods, sometimes by over 10%.

2 Related Work

Joint Visual-Textual Learning. In the literature, [52] has explored the connection between images and words using paired text documents, and [69, 16] proposed to jointly learn image-text embeddings with the class name annotations. Recently, CLIP [55] and ALIGN [27] have further scaled up the training with large-scale web data. Using simple noise contrastive learning, it is shown that powerful visual representation can be learnt from paired image-caption. In video domains, similar idea has also been explored for representation learning [47] and video retrieval [48, 33].

Prompting refers to designing proper “instructions” that a pre-trained language model can understand, and generate desired outputs, using a few examples as demonstrations. For instance, given properly handcrafted prompt templates, GPT-3 [5] has shown strong generalisations for few-shot or zero-shot learning. However, the handcrafted templates require extensive expert knowledge, limiting the flexibility. Later work proposes to automate the prompt engineering by searching discrete prompts [29, 58, 57, 22], and continuous prompts [35, 34]. In this work, we consider to search continuous prompts for steering pre-trained visual-language models to tackle video understanding tasks.

Video Action Recognition. In the last decade, research on developing effective architectures has gone through rapid developments, from two-stream networks [61, 66, 15] to more recent single stream RGB networks [8, 65, 70, 14, 13, 2]. With the help of abundant training data, e.g. Kinetics [7], recognition accuracy has been steadily improved. In addition, researchers have also explored the ideas of data-efficient learning, for instance, few-shot and open-set action recognition. Specifically, in the former line of research, only a few training samples are available from each action category, [82, 83] proposed compound memory networks to classify videos by matching and ranking; [11] used GANs to synthesize training examples for novel categories; [6] proposed differentiable dynamic time warping to align videos of different lengths; [54] exploited CrossTransformer, to find temporally-corresponding frame tuples between the query and given few-shot videos. While in open-set action recognition, it requires the model to generalise towards action categories that are unseen in the training set, one typical idea lies in learning a common representation space that is shared by seen and unseen actions, such as attributes space [42, 19], semantic space [36, 20], synthesizing features to unseen actions [49], using objects to create common space for unseen actions [46].

Video Action Localisation considers to detect and classify actions of interest from untrimmed long videos. Generally speaking, there are two popular paradigms: the two-stage paradigm [60, 71, 80, 9, 41, 39, 64, 30] first detects class-agnostic action proposals, which cover correct segments with high recall, then classifies and refines each of these proposals. In contrast, the one-stage paradigm [75, 40, 53] combines action detection and classification, and densely classifies each timestamp into action or background.

Concurrent Work. Several recent arXiv papers [81, 21, 78] also consider the prompt learning for efficient transfer from pre-trained visual-language models to downstream tasks, e.g. image classification. In video domains, CLIP4Clip [45] and ActionCLIP [67] propose to end-to-end finetune pre-trained CLIP on individual video tasks, e.g. retrieval and action recognition. In contrast, we favor the efficient adaptation from image to video, and aim to learn task-specific continuous prompts, to generate classifiers for various video understanding tasks under one same framework.

Figure 1: Framework Overview. We adopt a lightweight Transformer module on top of the CLIP image encoder for temporal modeling; prepend / append learnable continuous prompt vectors to the CLIP text encoder for generating action classifiers or text query embeddings. During training, both image and text encoders of CLIP are kept frozen. By optimising the task-specific prompt vectors and temporal Transformer module, we efficiently adapt CLIP to various video understanding tasks: action recognition, action localisation, and text-video retrieval, across closed-set, few-shot, and open-set scenarios.

3 Method

Our goal is to efficiently steer an image-based visual-language model to tackle novel downstream tasks, which we term as model adaptation. Here, we consider the downstream task to be video understanding, i.e. action recognition, localisation, and video retrieval. To be self-contained, in Section 3.1, we briefly review the pre-training and inference procedure of the original CLIP [55]; in Section 3.2, we describe the proposed idea for learning task-specific prompt vectors, and temporal modeling.

3.1 Visual-Language Model: CLIP

Pre-training. Given

(image, text) pairs in one sampling batch, the feature embeddings for image and text are computed with two encoders respectively, and dense cosine similarity matrix is calculated between all

possible (image, text) pairs. The training objective is to jointly optimise the image and text encoders, by maximizing the similarity between correct pairs of (image, text) associations, while minimizing the similarity for incorrect pairs via a symmetrical cross-entropy over the dense matrix, i.e. noise contrastive learning.

Note that, both image and text encoders contain a tokeniser for converting image patches or language words to vectors. In particular, the input images are divided into patches and then flattened into vectors, also called “visual tokens”; while the input texts are converted into vectors (“textual tokens”) by a trainable look-up table.

Inference. Once trained, CLIP can be deployed for image classification tasks on open vocabulary (zero-shot generalisation), with the visual classifiers being generated from the text encoder (), which resembles the idea of hypernetwork [24]. For example, to classify an image as cat or dog, the classifiers ( and ) can be generated as:

and “ this is a photo of [] ” is a handcrafted prompt template, which has shown to be effective for image classification.

Discussion. Despite the tremendous success on zero-shot image classification, CLIP has also shown to be sensitive to the handcrafted prompt template, clearly posing limitations on its efficient adaptation for novel downstream tasks, where expert knowledge might be difficult to condense or unavailable. So here, we consider to automate such prompt design procedures, exploring efficient approaches to adapt the pre-trained visual-language models for the novel downstream video-related tasks, with minimal training.

3.2 Prompting CLIP for Video Understanding

In following sections, we start by describing the problem scenario and notations (Section 3.2.1); and introduce the idea for model adaptation through prompt learning (Section 3.2.2); lastly, we augment the CLIP image encoder with temporal modeling, disambiguating the events or actions that require temporal reasoning (Section 3.2.3).

3.2.1 Problem Scenario

Given a dataset that consists of training and validation sets, , e.g. . The video can range from seconds (recognition and retrieval), to minutes long (localisation). Respectively, either refers to one of the action labels in the text format for recognition, e.g. ; or dense action category labels of timestamps for localisation, e.g. ; or fine-grained text descriptions for retrieval, e.g. .

In the closed-set scenario, the action categories for training and evaluation are the same, i.e. ; while in the open-set case, the action categories for training and evaluation are disjoint, i.e. .

3.2.2 Model Adaptation by Learning Prompts

The goal here is to steer a pre-trained CLIP model to perform various video tasks with minimal training. In specific, we strive for efficient model adaptation by prepending / appending a sequence of continuous random vectors (“prompt vectors”) with the textual tokens. While training, both the image and text encoders of CLIP are kept frozen, and the gradients will flow through the text encoder to only update the prompt vectors. Ultimately, these learned vectors end up constructing “virtual” prompt templates that can be understood by the text encoder, and to generate desired classifiers or query embeddings, as detailed below.

Action Recognition considers to classify the video clip or snippet into one of the action categories. In order to generate the classifier, we construct the “virtual” prompt template by feeding the tokenised action category name into the pre-trained text encoder (), for instance:

where denotes -th prompt vector, consisting of learnable parameters, and is the vector dimension. refers to the generated classifier for the action of “archery”. Note that, the prompt vectors are shared for all action categories, thus are only task-specific.

Action Localisation considers to localise and classify actions from untrimmed videos. Here, we leverage the two-stage paradigm [9, 79] to first detect potential class-agnostic action proposals (detailed in Section 4.1), and followed by the action classification on these detected proposals.

Text-Video Retrieval considers to jointly learn the visual and textual embeddings that pair the video and its corresponding textual description. In contrast to action recognition, where a video snippet is coarsely labeled by an action category, the text description in video retrieval contains more fine-grained details, usually a sentence. We here similarly tokenise the entire sentence, and feed the tokenised results to the text encoder with learnable prompt vectors, to generate the query embedding for each sentence.

Summary. In general, learning prompts for model adaptation offers the following benefits: First, since both classification and retrieval can be tackled with one framework, with classifiers or query embeddings generated from text, either category names or free-form descriptions, all tasks can use one shared backbone, yet achieve competitive performance (Section 4); Second, adapting to novel tasks only requires to optimise a few prompt vectors, facilitating the few-shot problem (Section 4.2.2); Third, it enables to make better use of the abundant training data, and further generalise beyond the closed-set categories (Section 4.2.3).

3.2.3 Temporal Modeling

As for pre-training, CLIP has thoroughly relied on the (image, text) pairs, which poses clear pros on cons. On the one hand, the (image, text) data used for training can be easily crawled from the web, which enables to learn much richer contents under a given compute constraint; however, on the other hand, it ignores the temporal component of the visual scene, and struggles to recognise the dynamic events, e.g. push or pull, open or close. In this section, we bridge this image-to-video gap by adding a simple and lightweight temporal modeling module.

Specifically, we upgrade the CLIP image encoder into a video one , by attaching Transformers on top of frame-wise features from the frozen image encoder:

where refers to the temporal modeling module, which is a multi-layer Transformer Encoder, consisting of Multi-head Self-attention, Layer Norm, and MLPs. To indicate the temporal order, we also add learnable temporal positional encoding onto the image features. The denotes dense feature embeddings of the frames.

3.2.4 Training Loss

Given a batch of (video, text) pairs, the visual stream ends up with dense frame-wise feature embeddings (); while for the textual stream, depending on the considered downstream tasks, it ends up with a set of action classifiers () or textual query embeddings ().

For action recognition and text-video retrieval, we further compute the video-snippet-level feature by taking the mean pooling of the dense features:

(1)

For action localisation, we take mean-pool of the dense features within each detected action proposal, to obtain the proposal-level feature (also denoted as for simplicity).

During training, we jointly optimise the textual prompt vectors and temporal modeling module, such that the video (proposal) features and its paired classifier or textual query embedding emit the highest similarity score among others. This is achieved with a simple NCE loss:

(2)

Note that both and have been L2-normalised here, and refers to the temperature hyper-parameter for scaling. In this way, we effectively close the training objective gap between the CLIP pre-training and various video tasks.

4 Experiments

In this section, we conduct experiments on three different video understanding tasks, i.e. action recognition, action localisation, text-video retrieval, across different datasets. In Section 4.2, we conduct ablation studies on action recognition, to validate the usefulness of proposed components, i.e., prompt learning and temporal modeling. In Section 4.3 and Section 4.4, we further benchmark on two other popular tasks: action localisation and text-video retrieval.

4.1 Implementation Details

In this paper, the image and text encoders are adopted from pre-trained CLIP (ViT-B/16+Transformer). For model adaptation, both encoders are kept frozen, the only trainable parts are the textual prompt vectors and temporal modeling module. All video frames are pre-processed to spatial resolution, and the maximum number of textual tokens is 77 (follow the original CLIP design).

For action recognition, all videos are decoded to fps, and frames are sampled per video with a random frame gap () for training. The temporal positional encodings consist of each frame’s index and the frame sampling gap (that is, video playing speed). The model is optimised using AdamW [44] with a learning rate of , and a batch size of videos.

For action localisation, we follow a two-stage paradigm: class-agnostic action proposal detection and proposal classification. To obtain high-quality action proposals, we first divide the entire video into equal-frame snippets; then use the CLIP image encoder with one Transformer layer to extract frame-wise embeddings for each snippet; and finally feed these embeddings to the off-the-shelf proposal detectors [37, 74]. These detectors construct the feature pyramid, and make predictions in parallel, to determine actionness, centerness, and boundaries. Please refer to [37, 74] for detailed detector architectures and optimisations. Note that, our method is flexible to the choice of proposal detectors, and we do not innovate on such candidate proposal procedures. To generate proposal classifiers, we adopt the same implementation details as for action recognition.

For video retrieval, we also take the -frame input with a random frame gap (). Note that, here we use significantly larger frame gaps than action recognition, as the retrieval task tends to require information from long-term visual dependencies. For more details, we refer the readers to the Appendices B.

4.2 Action Recognition

Datasets & Metrics. In this section, we conduct experiments on four popular benchmarks. HMDB-51 [32] contains 7k videos of actions. Its standard split is to train on videos and evaluate on another videos. UCF-101 [62] contains 13k videos spanning human actions. The standard split is to train on videos and evaluate on the left videos. Kinetics-400 [31] (K-400) covers around 230k -second clips sourced from YouTube. Each clip describes one action category, and can be of different resolutions and frame rates. Kinetics-700 [7] (K-700) is simply an extension of K-400, with around 650k video clips sourced from YouTube. To evaluate the recognition performance, we report the standard TOP1 and TOP5 accuracy, and the average of these two metrics.

K-400 K-700 Model Prompt Temporal TOP1 TOP5 AVG TOP1 TOP5 AVG Baseline-I [55] hand-craft 52.4 Baseline-II [55] 66.1 A0 2+X+2 65.4 88.7 77.1 56.3 81.9 69.1 A1 4+X+4 66.1 89.0 77.6 56.6 82.4 69.5 A2 8+X+8 67.9 90.0 79.0 57.4 83.0 70.2 A3 16+X+16 68.8 90.1 79.5 57.8 83.1 70.5 A4 16+X+16 1-TFM 75.8 92.9 84.4 64.2 87.3 75.8 A5 16+X+16 2-TFM 76.6 93.3 85.0 64.7 88.5 76.6 A6 16+X+16 3-TFM 76.9 93.5 85.2 64.8 88.4 76.6 A7 16+X+16 4-TFM 76.8 93.5 85.2 64.9 87.9 76.4
Table 1: Ablation study for closed-set action recognition. Baseline-I is the “zero-shot” CLIP inference with handcrafted templates. Baseline-II is the standard linear probe on the pre-trained CLIP image encoder. TFM denotes the number of Transformer layers for temporal modeling.
HMDB-51 UCF-101 K-400 K-700 Method TOP1 TOP5 TOP1 TOP5 TOP1 TOP5 TOP1 TOP5 I3D [8] 74.3 95.1 71.6 90.0 58.7 81.7 S3D-G [70] 75.9 96.8 74.7 93.4 R(2+1)D [65] 74.5 96.8 72.0 90.0 TSM [38] 74.7 R3D-50 [25] 66.0 92.0 54.7 NL-I3D [68] 66.0 76.5 92.6 SlowFast [14] 77.0 92.6 X3D-XXL [13] 80.4 94.6 TimeSformer-L [2] 80.7 94.7 Ours (A5) 66.4 92.1 93.6 99.0 76.6 93.3 64.7 88.5
Table 2: Comparison to state-of-the-art approaches on closed-set action recognition. By training far fewer parameters, our model achieves comparable performance to the existing approaches on all datasets.
Method K-shot N-way Prompt Temporal UCF-101 HMDB-51 K-400
TARN [3] 5  5 78.5
ARN [77] 5  5 83.1 60.6 82.4
OTAM [6] 5  5 85.8
TRX [54] 5  5 96.1 75.6 85.9
Baseline-I [55] –  5 hand-craft 91.9 68.9 95.1
Ours 5  5 98.3 85.3 96.4
5  5 97.8 84.9 96.0
Baseline-I [55] –   hand-craft 64.7 40.1 54.2
Ours 5   77.6 56.0 57.1
5   79.5 56.6 58.5
Table 3: Comparison to state-of-the-art approaches on few-shot action recognition. Here, refers to the case where the model is tested on all categories of the corresponding dataset, rather than only 5-way classification, e.g. 101 categories for UCF, 400 categories for K-400. Baseline-I denotes the “zero-shot” CLIP inference with handcrafted templates.

4.2.1 Closed-set Action Recognition

Closed-set video action recognition refers to the commonly adopted setting, where the model is trained and evaluated on videos from the same action categories, i.e., . For comprehensive comparison, we here train and evaluate on the standard splits of four popular benchmarks, namely HMDB-51, UCF-101, K-400, and K-700.

Ablation Studies are conducted based on the two largest benchmark datasets, namely, K-400 and K-700. Table 2 presents the results for the prompt learning and temporal modeling module. And here, the prompt follows the format of . Note that, although we prepend and append equal number of prompt vectors, the optimisation can perfectly learn to ignore any of these vectors, thus we do not ablate the other prompt formats.

As baselines, we compare with the official results reported in the original CLIP [55]. Specifically, Baseline-I refers to the “zero-shot” inference with handcrafted prompt templates (“ a photo of []. ”), and Baseline-II denotes the standard practice for training linear classifiers on top of the pre-trained image encoder on the downstream datasets.

Generally speaking, training more text prompt vectors brings consistent improvements on both TOP1 and TOP5 accuracy; In addition, adding temporal modeling also brings immediate benefits, with average gains of 4.9% and 5.3% on K-400 and K-700. However, it gives diminishing returns as more Transformer layers are added. Overall, all results suggest that, both the prompt learning and temporal modeling are essential. While comparing with Baseline-I, the A3 model demonstrates a performance boost of 18.1%, clearly showing the benefits of learning prompt vectors over handcrafted ones. Moreover, even with fewer trainable parameters, the A3 model also surpasses Baseline-II, with 4.4% gains, showing the superiority of prompting adaptation.

For all following action recognition experiments, we inherit the best practice from the ablation study, i.e. prepend / append prompt vectors to category names, and only use two Transformer layers for temporal modeling, for its best trade-off on performance and computational cost.

Comparison to SOTA. Table 2 compares our model with the existing state-of-the-art approaches on the popular action recognition benchmarks. Overall, on all datasets, our model performs comparably with the competitors, although we only need to train far fewer parameters, i.e. two Transformer layers and several prompt vectors, advocating efficient model adaptation.

4.2.2 Few-shot Action Recognition

Few-shot action recognition aims to classify actions in the video with only a few samples, in this section, we consider to benchmark on two different settings. The first one follows the previous literature [77, 54, 6], and evaluates on the standard -shot, -way classification; while in the second part, we consider a more challenging setting that classifies all categories with -shot training samples. For more details on dataset splits, please refer to the Appendices A.1. As baselines, in both settings, we use the “zero-shot” CLIP inference with handcrafted templates.

Open-Set
Model Prompt Temporal TOP1 TOP5 AVG
Baseline-I [55] hand-craft 52.4 77.3 64.9
B0 4+X+4 57.4 83.3 70.4
B1 8+X+8 57.7 82.6 70.2
B2 16+X+16 58.4 82.6 70.5
B3 32+X+32 57.5 84.6 71.1
B4 16+X+16 1-TFM 47.9 76.8 62.4
B5 16+X+16 2-TFM 45.5 75.4 60.5
B6 16+X+16 3-TFM 45.6 75.2 60.4
Table 4: Ablation study for open-set action recognition on K-700. Baseline-I refers to the results from CLIP open-set evaluation. The model is trained on 400 categories and tested on the other 300 disjoint categories.

-Shot--Way Setting. For fair comparison, this setting adopts the publicly accessible few-shot splits. Specifically, for HMDB-51 and UCF-101, we follow [77] to collect and testing action categories respectively; while for K-400, we follow [82, 54] to collect testing categories, each containing videos. During training, we sample action categories (ways) from the above data, with videos (shots) from each category, and use the remaining data for evaluation. To ensure the statistical significance of experiments, we conduct trials with random samplings.

Table 3 presents the average TOP1 accuracy for the three datasets. Our method (with or without temporal modeling) clearly outperforms all previous methods by a significant margin, around 10% on HMDB-51 and K-400, indicating the superiority of our proposed idea for model adaptation.

THUMOS14 (mAP@IoU) ActivityNet1.3 (mAP@IoU)
Method Date Mode 0.3 0.4 0.5 0.6 0.7 AVG 0.5 0.75 0.95 AVG
CDC [59] 2017 RGB+Flow 40.1 29.4 23.3 13.1 7.9 22.8 45.3 26.0 0.2 23.8
TALNET [9] 2018 RGB+Flow 53.2 48.5 42.8 33.8 20.8 39.8 38.2 18.3 1.3 20.2
BSN [41] 2018 RGB+Flow 53.5 45.0 36.9 28.4 20.0 36.8 46.5 30.0 8.0 30.0
DBS [23] 2019 RGB+Flow 50.6 43.1 34.3 24.4 14.7 33.4
BUTAL [79] 2020 RGB+Flow 53.9 50.7 45.4 38.0 28.5 43.3 43.5 33.9 9.2 30.1
A2NET [74] 2020 RGB+Flow 58.6 54.1 45.5 32.5 17.2 41.6 43.6 28.7 3.7 27.8
GTAD [73] 2020 RGB+Flow 66.4 60.4 51.6 37.6 22.9 47.8 50.4 34.6 9.0 34.1
BSN++ [63] 2021 RGB+Flow 59.9 49.5 41.3 31.9 22.8 41.1 51.3 35.7 8.3 34.9
AFSD [37] 2021 RGB+Flow 67.3 62.4 55.5 43.7 31.1 52.0 52.4 35.3 6.5 34.4
TALNET [9] 2018 RGB 42.6 31.9 14.2
A2NET [74] 2020 RGB 45.0 40.5 31.3 19.9 10.0 29.3 39.6 25.7 2.8 24.8
Baseline-III 2021 RGB 36.3 31.9 25.4 17.8 10.4 24.3 28.2 18.3 3.7 18.2
Ours 2021 RGB 50.8 44.1 35.8 25.7 15.7 34.5 44.0 27.0 5.1 27.3
Table 5: Comparison to state-of-the-art approaches on closed-set action localisation. AVG denotes the average mAP in [0.3:0.1:0.7] on THUMOS14, and [0.5:0.05:0.95] on ActivityNet1.3. Baseline-III uses the same proposal detector as Ours, but uses CLIP with handcrafted templates as proposal classifiers.

-Shot--Way Setting. Here, we further consider a more challenging scenario, scaling the problem up to classifying all categories in the dataset with only 5 samples per category, for example, for K-400, for UCF-101. Specifically, on each dataset, we sample videos (shots) from the training set for each action category, to form the few-shot training data, and then measure performance on the corresponding standard testing set.

For this experiment setting, we conduct random sampling rounds, and record the average TOP1 accuracy in Table 3. Comparing to the -way classification, the -way setting is clearly more challenging, our model (with or without temporal modeling) still demonstrates promising results. While comparing to the Baseline-I, our performance gains on UCF-101 and HMDB-51 are around 15%.

4.2.3 Open-set Action Recognition

In this section, we evaluate on open-set action recognition, with videos in the training set and validation set from different categories, i.e. . Specifically, we split the K-700 dataset into two parts, with categories for training, and the remaining categories for evaluation. For more details on the dataset splits, please refer to the Appendices A.2.

As a baseline, we evaluate the CLIP model with handcrafted prompt templates. As reported in Table 4, our model achieves 6.0% improvements on TOP1 accuracy over the Baseline-I, showing the effectiveness of prompt learning for open-set recognition. Interestingly, the number of learnable prompt vectors does not make a difference, and adding temporal modeling diminishes the performance gain. We conjecture this is because the additional Transformer layer will specialise on the training set, thus harming the generalisation towards unseen action categories.

4.2.4 Conclusion & Discussion

Among all the benchmarks on action recognition, we have demonstrated the effectiveness of prompt learning and temporal modeling. For closed-set action recognition, even without temporal modeling, learnable prompts clearly surpass the handcrafted ones, and linear probe settings. While comparing to state-of-the-art approaches, despite training far fewer parameters, our model still demonstrates competitive performance on all benchmarks. For few-shot action recognition with limited number of training samples, model adaptation through prompt learning really shines, outperforming all previous methods significantly. Lastly, for open-set scenarios, textual prompts enable to make better use of the abundant training data, and further improve the generalisation beyond the seen actions.

4.3 Action Localisation

Datasets & Metrics. In this section, we conduct experiments on two localisation datasets. THUMOS14 [28] covers untrimmed sports videos of action categories, with an average of action instances per video. The standard split is training videos and validation videos. ActivityNet1.3 [26] has around 20k untrimmed videos of action categories, each video contains an average of action instances. The standard split is training videos and

validation videos. For evaluation metrics, we follow the conventional choices and use mean Average Precision (mAP) at different IoU thresholds. On THUMOS14, we report the mAP at the IoU set

; as for ActivityNet1.3, the IoU set is .

4.3.1 Closed-set Action Localisation

Closed-set action localisation refers to the common setting, where the model is trained and evaluated on videos of the same action categories, i.e. . For a fair comparison, we use the same dataset splits as in the literature.

Table 5 reports the comparison results. As a baseline, we adopt the same first-stage proposal detector, but use the original CLIP with handcrafted templates (“ this is a photo of [] ”) for the second-stage proposal classifier. On both datasets, our model significantly outperforms the Baseline-III, again confirming the effectiveness of prompt learning and temporal modeling. While comparing with other existing methods that use pre-trained RGB stream, our method also demonstrates superior performance, with around 5.2% and 2.5% gains on average mAP respectively.

4.3.2 Open-set Action Localisation

In this section, we evaluate on the open-set scenario, i.e. action categories for training and testing are disjoint. As we are not aware of any existing benchmarks on this challenging problem, we initiate two evaluation settings on THUMOS14 and ActivityNet1.3: one is to train with 75% action categories and test on the left 25% action categories; the other is to train with 50% categories and test on the left 50% categories. To ensure statistical significance, we conduct random samplings to split categories for each setting.

Table 6 shows the mean performance. In two-stage localisation, as proposals are class-agnostic, the key is the proposal classifier. Similar to closed-set localisation, for comparison, we implement the same baseline, which uses the same proposal detector as our model, but classifies action proposals using original CLIP with handcrafted prompts. In both settings, our model shows superior performance than the Baseline-III. However, when comparing with the closed-set evaluation, the performance of open-set drops dramatically. Note that, such performance drop now comes from two sources: one is the recall drop from the first-stage class-agnostic action proposals, because for open-set settings, there exist certain differences in the boundary distribution of training and testing proposals; the other comes from the second-stage classification errors. In the Appendices C.1, we provide a full ablation study to show the performance degradation under the open-set scenario.

THUMOS14 ActivityNet1.3
Method Train 0.3 0.4 0.5 0.6 0.7 AVG 0.5 0.75 0.95 AVG
Baseline-III 75% 33.0 25.5 18.3 11.6 5.7 18.8 35.6 20.4 2.1 20.2
Ours 75% 39.7 31.6 23.0 14.9 7.5 23.3 37.6 22.9 3.8 23.1
Baseline-III 50% 27.2 21.3 15.3 9.7 4.8 15.7 28.0 16.4 1.2 16.0
Ours 50% 37.2 29.6 21.6 14.0 7.2 21.9 32.0 19.3 2.9 19.6
Table 6: Results of open-set action localisation. Baseline-III uses the same proposal detector as Ours, but uses CLIP with handcrafted templates as the proposal classifier. Our model is trained on 75% (or 50%) action categories and tested on the remaining 25% (or 50%) action categories.

4.4 Text-Video Retrieval

Datasets & Metrics. In this section, we evaluate on three large-scale text-video retrieval datasets. LSMDC [56] covers videos of to seconds. We train on validation videos, and evaluate on independent test videos. MSRVTT [72] has videos and captions. We train on the ‘Training-9K’ split [17], and test with the ‘test 1k-A’ [76] containing clip-text pairs. SMIT [50]

(Spoken Moments) contains over 500k videos randomly chosen from the M-MiT training set 

[51], and 10k validation videos. Each video contains at least one text description. For evaluation, following the literature, we report average recall at K (R@K). Due to the space limitation, we refer the readers for the full Table with median rank (MdR).

Results. Table 7 presents the results on three benchmarks. Note that, in these experiments, we only employ learnable prompt vectors, i.e. [4+X+4]. This is because the text encoder from pre-trained CLIP takes limited number of textual tokens up to , whereas the text query for retrieval tends to be long. And for the cases where the tokenised text query is longer than the maximum supported length of CLIP, we truncate the sequence to fit our specified pattern.

While comparing with the Baseline-IV that denotes the results from the original CLIP model with naïvely-encoded text queries, our proposed prompt learning and temporal modeling module demonstrate clear benefits on all benchmarks. Comparing with the existing approaches that are specifically targeting the text-video retrieval problem, our proposed method still performs competitively, although it only requires to optimise a few prompt vectors, along with two Transformer layers on the considered retrieval datasets.

MSRVTT (9K) LSMDC SMIT
Method E2E R@1 R@5 R@1 R@5 R@1 R@5
CE [43] 21.7 51.8 12.4 28.5
MMT [18] 24.6 54.0 13.2 29.2
TT-CE+ [10] 29.6 61.6 17.2 36.5
Baseline-IV 31.2 53.7 11.3 22.7 39.3 62.8
Ours ([4+X+4]) 36.7 64.6 13.4 29.5 66.6 87.8
Frozen [1] 31.0 59.5 15.0 30.8
CLIP4Clip [45] 44.5 71.4 22.6 41.0
Table 7: Results of text-video retrieval. Baseline-IV refers to the original CLIP model with text query naïvely encoded, i.e. without using any prompt. E2E denotes if the model has been trained end-to-end.

5 Conclusion

In this paper, building on image-based visual-language models, e.g. CLIP, we propose to learn task-specific prompt vectors for efficient and lightweight model adaptation. We evaluate the proposed idea on popular benchmarks of major video understanding tasks: action recognition, localisation, and text-video retrieval. Thorough comparisons and ablation studies are conducted to analyse the critical components, at different scenarios: closed-set, few-shot, and open-set. In the closed-set scenario, despite training only a small number of free parameters, we achieve competitive performance to the modern state-of-the-art methods. In few-shot and open-set scenarios, we significantly outperform existing methods across all tasks, sometimes by over 10%.

Limitation and Broader Impact. Our proposed idea relies on the visual-language model pre-trained on the large-scale image alt-text data, which could potentially incur two limitations: First, bias in the web data, Second, as temporal modeling is only used on top of visual features, it may fail to model fine-grained motions. As future work, we expect better visual-language models to be trained, further improving model adaptation by proposed prompt learning; and given more compute available, end-to-end finetuning with all data combined will further boost the performance.

References

  • [1] Max Bain, Arsha Nagrani, Gül Varol, and Andrew Zisserman. Frozen in time: A joint video and image encoder for end-to-end retrieval. Proceedings of the International Conference on Computer Vision, 2021.
  • [2] Gedas Bertasius, Heng Wang, and Lorenzo Torresani. Is space-time attention all you need for video understanding? In

    Proceedings of the International Conference on Machine Learning

    , 2021.
  • [3] Mina Bishay, Georgios Zoumpourlis, and Ioannis Patras. Tarn: Temporal attentive relation network for few-shot and zero-shot action recognition. In Proceedings of the British Machine Vision Conference, 2019.
  • [4] Navaneeth Bodla, Bharat Singh, Rama Chellappa, and Larry S Davis. Soft-nms–improving object detection with one line of code. In Proceedings of the IEEE international conference on computer vision, pages 5561–5569, 2017.
  • [5] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language models are few-shot learners. In Advances in Neural Information Processing Systems, 2020.
  • [6] Kaidi Cao, Jingwei Ji, Zhangjie Cao, Chien-Yi Chang, and Juan Carlos Niebles. Few-shot video classification via temporal alignment. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    , 2020.
  • [7] João Carreira, Eric Noland, Chloe Hillier, and Andrew Zisserman. A short note on the kinetics-700 human action dataset. arXiv preprint arXiv:1907.06987, 2019.
  • [8] Joao Carreira and Andrew Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017.
  • [9] Yu-Wei Chao, Sudheendra Vijayanarasimhan, Bryan Seybold, David A. Ross, Jia Deng, and Rahul Sukthankar. Rethinking the faster r-cnn architecture for temporal action localisation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018.
  • [10] Ioana Croitoru, Simion-Vlad Bogolin, Marius Leordeanu, Hailin Jin, Andrew Zisserman, Samuel Albanie, and Yang Liu. Teachtext: Crossmodal generalized distillation for text-video retrieval. In Proceedings of the International Conference on Computer Vision, 2021.
  • [11] Sai Kumar Dwivedi, Vikram Gupta, Rahul Mitra, Shuaib Ahmed, and Arjun Jain. Protogan: Towards few shot learning for action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019.
  • [12] Maksim Dzabraev, Maksim Kalashnikov, Stepan Komkov, and Aleksandr Petiushko. Mdmmt: Multidomain multimodal transformer for video retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, June 2021.
  • [13] Christoph Feichtenhofer. X3D: Expanding Architectures for Efficient Video Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2020.
  • [14] Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. SlowFast Networks for Video Recognition. In Proceedings of the International Conference on Computer Vision, 2019.
  • [15] Christoph Feichtenhofer, Axel Pinz, and Andrew Zisserman. Convolutional two-stream network fusion for video action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016.
  • [16] Andrea Frome, Greg S Corrado, Jon Shlens, Samy Bengio, Jeff Dean, Marc Aurelio Ranzato, and Tomas Mikolov. Devise: A deep visual-semantic embedding model. In Advances in Neural Information Processing Systems, 2013.
  • [17] Valentin Gabeur, Chen Sun, Karteek Alahari, and Cordelia Schmid. Multi-modal transformer for video retrieval. In Proceedings of the European Conference on Computer Vision, 2020.
  • [18] Valentin Gabeur, Chen Sun, Karteek Alahari, and Cordelia Schmid. Multi-modal transformer for video retrieval. In Proceedings of the European Conference on Computer Vision, 2020.
  • [19] Chuang Gan, Tianbao Yang, and Boqing Gongi. Learning attributes equals multi-source domain generalization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016.
  • [20] Chuang Gan, Yi Yang, Linchao Zhu, Deli Zhao, and Yueting Zhuang. Recognizing an action using its name: A knowledge-based approach. International Journal of Computer Vision, 2016.
  • [21] Peng Gao, Shijie Geng, Renrui Zhang, Teli Ma, Rongyao Fang, Yongfeng Zhang, Hongsheng Li, and Yu Qiao. Clip-adapter: Better vision-language models with feature adapters. arXiv preprint arXiv:2110.04544, 2021.
  • [22] Tianyu Gao, Adam Fisch, and Danqi Chen. Making pre-trained language models better few-shot learners. In Association for Computational Linguistics, 2021.
  • [23] Zhanning Gao, Le Wang, Qilin Zhang, Zhenxing Niu, Nanning Zheng, and Gang Hua. Video imprint segmentation for temporal action detection in untrimmed videos. In

    Proceedings of the AAAI Conference on Artificial Intelligence

    , 2019.
  • [24] David Ha, Andrew Dai, and Quoc Le. Hypernetworks. In Proceedings of the International Conference on Learning Representations, 2016.
  • [25] Kensho Hara, Hirokatsu Kataoka, and Yutaka Satoh.

    Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet?

    In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018.
  • [26] Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem, and Juan Carlos Niebles. Activitynet: A large-scale video benchmark for human activity understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015.
  • [27] Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc V. Le, Yunhsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. In Proceedings of the International Conference on Machine Learning, 2021.
  • [28] Yu-Gang Jiang, Jingen Liu, A Roshan Zamir, George Toderici, Ivan Laptev, Mubarak Shah, and Rahul Sukthankar. THUMOS challenge: Action recognition with a large number of classes. http://crcv.ucf.edu/THUMOS14/, 2014.
  • [29] Zhengbao Jiang, Frank F. Xu, Jun Araki, and Graham Neubig. How can we know what language models know? Transactions of the Association for Computational Linguistics, 2020.
  • [30] Chen Ju, Peisen Zhao, Siheng Chen, Ya Zhang, Yanfeng Wang, and Qi Tian. Divide and conquer for single-frame temporal action localization. In Proceedings of the International Conference on Computer Vision, 2021.
  • [31] Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, Mustafa Suleyman, and Andrew Zisserman. The kinetics human action video dataset. arXiv preprint arXiv:1705.06950, 2017.
  • [32] H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre. HMDB: A large video database for human motion recognition. In Proceedings of the International Conference on Computer Vision, 2011.
  • [33] Jie Lei, Linjie Li, Luowei Zhou, Zhe Gan, Tamara L. Berg, Mohit Bansal, and Jingjing Liu. Less is more: Clipbert for video-and-language learningvia sparse sampling. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2021.
  • [34] Brian Lester, Rami Al-Rfou, and Noah Constant. The power of scale for parameter-efficient prompt tuning. In Proceedings of the Conference on Empirical Methods in Natural Language Processinng, 2021.
  • [35] Xiang Lisa Li and Percy Liang. Prefix-tuning: Optimizing continuous prompts for generation. In Association for Computational Linguistics, 2021.
  • [36] Yikang Li, Sheng hung Hu, and Baoxin Li. Recognizing unseen actions in a domain-adapted embedding space. In IEEE International Conference on Image Processing, 2016.
  • [37] Chuming Lin, Chengming Xu, Donghao Luo, Yabiao Wang, Ying Tai, Chengjie Wang, Jilin Li, Feiyue Huang, and Yanwei Fu. Learning salient boundary feature for anchor-free temporal action localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2021.
  • [38] Ji Lin, Chuang Gan, and Song Han. TSM: Temporal shift module for efficient video understanding. In Proceedings of the International Conference on Computer Vision, 2019.
  • [39] Tianwei Lin, Xiao Liu, Xin Li, Errui Ding, and Shilei Wen. Bmn: Boundary-matching network for temporal action proposal generation. In Proceedings of the International Conference on Computer Vision, 2019.
  • [40] Tianwei Lin, Xu Zhao, and Zheng Shou. Single shot temporal action detection. In Proceedings of the ACM international conference on Multimedia, 2017.
  • [41] Tianwei Lin, Xu Zhao, Haisheng Su, Chongjing Wang, and Ming Yang. BSN: Boundary sensitive network for temporal action proposal generation. In Proceedings of the European Conference on Computer Vision, 2018.
  • [42] Jingen Liu, Benjamin Kuipers, and Silvio Savarese. Recognizing human actions by attributes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2011.
  • [43] Yang Liu, Samuel Albanie, Arsha Nagrani, and Andrew Zisserman. Use what you have: Video retrieval using representations from collaborative experts. In Proceedings of the British Machine Vision Conference, 2019.
  • [44] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In Proceedings of the International Conference on Learning Representations, 2019.
  • [45] Huaishao Luo, Lei Ji, Ming Zhong, Yang Chen, Wen Lei, Nan Duan, and Tianrui Li. CLIP4Clip: An empirical study of clip for end to end video clip retrieval. arXiv preprint arXiv:2104.08860, 2021.
  • [46] Pascal Mettes, William Thong, and Cees G. M. Snoek. Object priors for classifying and localizing unseen actions. In International Journal of Computer Vision, 2021.
  • [47] Antoine Miech, Jean-Baptiste Alayrac, Lucas Smaira, Ivan Laptev, Josef Sivic, and Andrew Zisserman. End-to-end learning of visual representations from uncurated instructional videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2020.
  • [48] Antoine Miech, Ivan Laptev, and Josef Sivic. Learning a text-video embedding from imcomplete and heterogeneous data. arXiv preprint arXiv:1804.02516, 2018.
  • [49] Ashish Mishra, Anubha Pandey, and Hema A. Murthy. Zero-shot learning for action recognition using synthesized features. Neurocomputing, 2020.
  • [50] Mathew Monfort, SouYoung Jin, Alexander Liu, David Harwath, Rogerio Feris, James Glass, and Aude Oliva. Spoken moments: Learning joint audio-visual representations from video descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2021.
  • [51] Mathew Monfort, Bowen Pan, Kandan Ramakrishnan, Alex Andonian, Barry A McNamara, Alex Lascelles, Quanfu Fan, Dan Gutfreund, Rogerio Feris, and Aude Oliva. Multi-moments in time: Learning and interpreting models for multi-action video understanding. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021.
  • [52] Yasuhide Mori, Hironobu Takahashi, and Ryuichi Oka. Image-to-word transformation based on dividing and vector quantizing images with words. In First International Workshop on Multimedia Intelligent Storage and Retrieval Management (ACM Multimedia Conference), 1999.
  • [53] Megha Nawhal and Greg Mori. Activity graph transformer for temporal action localization. arXiv preprint arXiv:2101.08540, 2021.
  • [54] Toby Perrett, Alessandro Masullo, Tilo Burghardt, Majid Mirmehdi, and Dima Damen. Temporal relational crosstransformers for few-shot action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2021.
  • [55] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning, 2021.
  • [56] Anna Rohrbach, Atousa Torabi, Marcus Rohrbach, Niket Tandon, Christopher Pal, Hugo Larochelle, Aaron Courville, and Bernt Schiele. Movie description. International Journal of Computer Vision, 2017.
  • [57] Timo Schick and Hinrich Schütze. Exploiting cloze questions for few shot text classification and natural language inference. In Proceedings of the Conference of the European Chapter of the Association for Computational Linguistics, 2021.
  • [58] Taylor Shin, Yasaman Razeghi, Robert L. Logan IV, Eric Wallace, and Sameer Singh. AutoPrompt: Eliciting knowledge from language models with automatically generated prompts. In Proceedings of the Conference on Empirical Methods in Natural Language Processinng, 2020.
  • [59] Zheng Shou, Jonathan Chan, Alireza Zareian, Kazuyuki Miyazawa, and Shih-Fu Chang. Cdc: Convolutional-de-convolutional networks for precise temporal action localization in untrimmed videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017.
  • [60] Zheng Shou, Dongang Wang, and Shih-Fu Chang. Temporal action localization in untrimmed videos via multi-stage cnns. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016.
  • [61] Karen Simonyan and Andrew Zisserman. Two-stream convolutional networks for action recognition in videos. In Advances in Neural Information Processing Systems, 2014.
  • [62] Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012.
  • [63] Haisheng Su, Weihao Gan, Wei Wu, Yu Qiao, and Junjie Yan. Bsn++: Complementary boundary regressor with scale-balanced relation modeling for temporal action proposal generation. In Proceedings of the AAAI Conference on Artificial Intelligence, 2021.
  • [64] Jing Tan, Jiaqi Tang, Limin Wang, and Gangshan Wu. Relaxed transformer decoders for direct action proposal generation. In Proceedings of the International Conference on Computer Vision, 2021.
  • [65] Du Tran, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann LeCun, and Manohar Paluri. A closer look at spatiotemporal convolutions for action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018.
  • [66] Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, and Luc Val Gool. Temporal segment networks: Towards good practices for deep action recognition. In Proceedings of the European Conference on Computer Vision, 2016.
  • [67] Mengmeng Wang, Jiazheng Xing, and Yong Liu. ActionCLIP: A new paradigm for video action recognition. arXiv preprint arXiv:2109.08472, 2021.
  • [68] Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He.

    Non-local neural networks.

    In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018.
  • [69] Jason Weston, Samy Bengio, and Nicolas Usunier. WSABIE: Scaling up to large vocabulary image annotation. In Proceedings of the International Joint Conference on Artificial Intelligence, 2011.
  • [70] Saining Xie, Chen Sun, Jonathan Huang, Zhuowen Tu, and Kevin Murphy. Rethinking spatiotemporal feature learning for video understanding. In Proceedings of the European Conference on Computer Vision, 2018.
  • [71] Huijuan Xu, Abir Das, and Kate Saenko. R-C3D: Region convolutional 3d network for temporal activity detection. In Proceedings of the International Conference on Computer Vision, 2017.
  • [72] Jun Xu, Tao Mei, Ting Yao, and Yong Rui. MSR-VTT: A large video description dataset for bridging video and language. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016.
  • [73] Mengmeng Xu, Chen Zhao, David S Rojas, Ali Thabet, and Bernard Ghanem. G-tad: Sub-graph localization for temporal action detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2020.
  • [74] Le Yang, Houwen Peng, Dingwen Zhang, Jianlong Fu, and Junwei Han. Revisiting anchor mechanisms for temporal action localization. IEEE Transactions on Image Processing, 2020.
  • [75] Serena Yeung, Olga Russakovsky, Greg Mori, and Li Fei-Fei. End-to-end learning of action detection from frame glimpses in videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016.
  • [76] Youngjae Yu, Jongseok Kim, and Gunhee Kim. A joint sequence fusion model for video question answering and retrieval. In Proceedings of the European Conference on Computer Vision, 2018.
  • [77] Hongguang Zhang, Li Zhang, Xiaojuan Qi, Hongdong Li, Philip H S Torr, and Piotr Koniusz. Few-shot action recognition with permutation-invariant attention. In Proceedings of the European Conference on Computer Vision, 2020.
  • [78] Renrui Zhang, Rongyao Fang, Peng Gao, Wei Zhang, Kunchang Li, Jifeng Dai, Yu Qiao, and Hongsheng Li. Tip-adapter: Training-free clip-adapter for better vision-language modeling. arXiv preprint arXiv:2111.03930, 2021.
  • [79] Peisen Zhao, Lingxi Xie, Chen Ju, Ya Zhang, Yanfeng Wang, and Qi Tian. Bottom-up temporal action localization with mutual regularization. In Proceedings of the European Conference on Computer Vision, 2020.
  • [80] Yue Zhao, Yuanjun Xiong, Limin Wang, Zhirong Wu, Xiaoou Tang, and Dahua Lin. Temporal action detection with structured segment networks. In Proceedings of the International Conference on Computer Vision, 2017.
  • [81] Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Learning to prompt for vision-language models. arXiv preprint arXiv:2109.01134, 2021.
  • [82] Linchao Zhu and Yi Yang. Compound memory networks for few-shot video classification. In Proceedings of the European Conference on Computer Vision, 2018.
  • [83] Linchao Zhu and Yi Yang. Label independent memory for semi-supervised few-shot video classification. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020.

Appendix A Dataset Splits

Here, we detail the dataset splits for training and testing, under different scenarios, namely, few-shot action recognition, open-set action recognition, open-set action localisation.

a.1 Few-shot Action Recognition

In this section, we benchmark on two different settings.

a.1.1 -Shot--Way Setting

In this evaluation scenario, we adopt the publicly available few-shot data splits, i.e. sample categories ( videos per category) from a set of testing categories, to form the few-shot support set. We conduct trials with random samplings, to ensure the statistical significance.

  • Kinetics-400. We follow [82, 54] and sample the test action categories from: blasting sand, busking, cutting watermelon, dancing ballet, dancing charleston, dancing macarena, diving cliff, filling eyebrows, folding paper, hula hooping, hurling (sport), ice skating, paragliding, playing drums, playing monopoly, playing trumpet, pushing car, riding elephant, shearing sheep, side kick, stretching arm, tap dancing, throwing axe, unboxing.

  • UCF-101. Following [77], the test action categories are sampled from : blowingcandles, cleanandjerk, cliffdiving, cuttinginkitchen, diving, floorgymnastics, golfswing, handstandwalking, horserace, icedancing, jumprope, pommelhorse, punch, rockclimbingindoor, salsaspin, skiing, skydiving, stillrings, surfing, tennisswing, volleyballspiking.

  • HMDB-51. Following [77], the test action categories are sampled from: fencing, kick, kick ball, pick, pour, pushup, run, sit, smoke, talk.

a.1.2 -Shot--Way Setting.

In this generalised few-shot setting, to construct the dataset for training, we sample videos from all categories and measure the performance on the corresponding standard testing set, i.e. all videos from all categories in the testing set. For statistical significance, we also conduct random sampling rounds to choose training videos.

  • Kinetics-400. The training set consists of videos, i.e.  videos, and the testing set covers videos.

  • UCF-101. The training set contains videos, i.e.  videos, and the testing set covers videos.

  • HMDB-51. The training data covers videos, i.e.  videos, and the testing set contains videos.

a.2 Open-set Action Recognition

In this section, we split the K-700 dataset into two subsets with disjoint classes. Specifically, action categories are used for training, and the remaining action categories are used for evaluation.

  • Training Categories (#400): carving wood with a knife, cracking neck, feeding goats, fixing bicycle, passing soccer ball, being in zero gravity, breaking boards, changing gear in car, playing organ, taking photo, finger snapping, walking on stilts, cleaning shoes, hoverboarding, putting wallpaper on wall, using atm, rock scissors paper, riding elephant, running on treadmill, cracking back, pulling rope (game), washing feet, skydiving, country line dancing, throwing knife, square dancing, fixing hair, folding clothes, doing jigsaw puzzle, making slime, using a power drill, welding, jumping jacks, cosplaying, surveying, bottling, smoking pipe, shooting basketball, swimming with dolphins, tying bow tie, cleaning gutters, playing cards, playing dominoes, uncorking champagne, drop kicking, folding paper, standing on hands, massaging neck, swing dancing, chopping meat, breading or breadcrumbing, laying concrete, driving car, sawing wood, clean and jerk, embroidering, pinching, playing saxophone, tango dancing, peeling banana, drumming fingers, throwing axe, lawn mower racing, roller skating, celebrating, dyeing eyebrows, arm wrestling, belly dancing, using segway, playing cello, news anchoring, mountain climber (exercise), treating wood, riding mechanical bull, cutting watermelon, playing laser tag, picking apples, using a sledge hammer, skipping rope, feeding fish, playing basketball, carving pumpkin, bee keeping, holding snake, walking through snow, fly tying, tightrope walking, playing monopoly, shopping, planing wood, brushing floor, cleaning pool, spinning poi, grooming horse, laughing, sign language interpreting, roasting pig, making cheese, ripping paper, decorating the christmas tree, spraying, snowkiting, putting on shoes, playing cricket, ironing, mosh pit dancing, swimming butterfly stroke, ironing hair, making the bed, chiseling stone, javelin throw, playing keyboard, poaching eggs, playing recorder, blowing nose, high kick, shot put, tasting beer, laying tiles, making paper aeroplanes, being excited, parkour, playing piano, throwing discus, wading through mud, washing dishes, headbutting, tying knot (not on a tie), unloading truck, visiting the zoo, picking blueberries, gymnastics tumbling, playing checkers, hugging baby, playing netball, spray painting, attending conference, playing trombone, using bagging machine, listening with headphones, making sushi, trimming or shaving beard, swimming with sharks, throwing water balloon, plastering, playing pan pipes, directing traffic, assembling computer, making horseshoes, ice swimming, pull ups, battle rope training, blowdrying hair, doing laundry, ice skating, shouting, surfing water, barbequing, vacuuming floor, squat, dribbling basketball, chasing, throwing ball (not baseball or American football), eating doughnuts, contact juggling, deadlifting, dancing gangnam style, pretending to be a statue, shaving head, putting on eyeliner, blowing bubble gum, jumping into pool, juggling fire, grinding meat, moving furniture, tagging graffiti, skiing mono, bookbinding, walking the dog, petting animal (not cat), falling off bike, scrambling eggs, sipping cup, separating eggs, historical reenactment, springboard diving, eating watermelon, card throwing, using a microscope, playing poker, making pizza, assembling bicycle, backflip (human), seasoning food, getting a tattoo, shining shoes, snatch weight lifting, installing carpet, getting a haircut, laying decking, rock climbing, sieving, rope pushdown, opening bottle (not wine), salsa dancing, catching or throwing baseball, texting, clapping, mopping floor, pirouetting, scuba diving, coughing, climbing a rope, changing oil, yarn spinning, playing guitar, using a paint roller, snowmobiling, tying necktie, vacuuming car, petting horse, busking, paragliding, playing kickball, chewing gum, giving or receiving award, drooling, putting in contact lenses, alligator wrestling, doing aerobics, whistling, somersaulting, carrying baby, decoupage, slicing onion, jetskiing, carving ice, baking cookies, checking watch, rolling pastry, pumping fist, crocheting, eating burger, jumping sofa, dodgeball, karaoke, waxing back, leatherworking, passing American football (not in game), massaging feet, dumpster diving, making balloon shapes, cracking knuckles, eating spaghetti, catching or throwing frisbee, drinking shots, playing gong, acting in play, shoveling snow, sharpening knives, using megaphone, doing nails, burping, inflating balloons, flying kite, herding cattle, doing sudoku, eating hotdog, putting on sari, punching bag, singing, squeezing orange, pushing cart, splashing water, playing trumpet, exercising arm, fencing (sport), ski jumping, lock picking, carrying weight, using inhaler, waking up, staring, photobombing, eating carrots, bungee jumping, checking tires, weaving fabric, home roasting coffee, playing didgeridoo, getting a piercing, building cabinet, jumping bicycle, capoeira, reading newspaper, playing rubiks cube, high jump, raising eyebrows, stretching arm, shooting off fireworks, dancing charleston, pillow fight, hockey stop, steering car, drawing, recording music, front raises, riding camel, wrapping present, waxing legs, sleeping, cooking scallops, sucking lolly, cutting cake, threading needle, base jumping, dining, trapezing, tackling, building shed, tiptoeing, cooking chicken, playing harmonica, training dog, setting table, curling eyelashes, passing American football (in game), docking boat, playing paintball, sneezing, playing with trains, swimming breast stroke, sticking tongue out, cutting pineapple, lunge, triple jump, marriage proposal, cleaning windows, diving cliff, bench pressing, making a cake, saluting, luge, driving tractor, swimming front crawl, bending back, laying stone, pushing car, sanding wood, dunking basketball, sanding floor, sausage making, robot dancing, building sandcastle, tasting food, spelunking, baby waking up, playing darts, playing american football, land sailing, sword fighting, ski ballet, playing mahjong, smelling feet, blasting sand, peeling potatoes, smoking, hurdling, grooming cat, pouring beer, bobsledding, flint knapping, washing hands, clay pottery making, digging, air drumming, moving child, fidgeting, packing, delivering mail, skipping stone, cartwheeling, playing bass guitar, tai chi, using remote controller (not gaming), playing pinball, bartending, waxing chest, parasailing, egg hunting, carving marble, wrestling, snowboarding, headbanging, playing hand clapping games, abseiling, crawling baby, skiing slalom, frying vegetables, wading through water.

  • Testing Categories (#300): adjusting glasses, answering questions, applauding, applying cream, archaeological excavation, archery, arguing, arranging flowers, arresting, auctioning, bandaging, bathing dog, beatboxing, bending metal, biking through snow, blending fruit, blowing glass, blowing leaves, blowing out candles, bodysurfing, bouncing ball (not juggling), bouncing on bouncy castle, bouncing on trampoline, bowling, braiding hair, breakdancing, breaking glass, breathing fire, brushing hair, brushing teeth, brush painting, building lego, bulldozing, calculating, calligraphy, canoeing or kayaking, capsizing, card stacking, casting fishing line, catching fish, catching or throwing softball, changing wheel (not on bike), cheerleading, chiseling wood, chopping wood, clam digging, cleaning toilet, climbing ladder, climbing tree, closing door, coloring in, combing hair, contorting, cooking egg, cooking on campfire, cooking sausages (not on barbeque), counting money, crossing eyes, crossing river, crying, cumbia, curling hair, curling (sport), cutting apple, cutting nails, cutting orange, dancing ballet, dancing macarena, dealing cards, disc golfing, dyeing hair, eating cake, eating chips, eating ice cream, eating nachos, entering church, exercising with an exercise ball, extinguishing fire, faceplanting, falling off chair, feeding birds, filling cake, filling eyebrows, flipping bottle, flipping pancake, folding napkins, gargling, geocaching, gold panning, golf chipping, golf driving, golf putting, gospel singing in church, grooming dog, hammer throw, hand washing clothes, head stand, helmet diving, high fiving, hitting baseball, hopscotch, huddling, hugging (not baby), hula hooping, hurling (sport), ice climbing, ice fishing, jaywalking, jogging, juggling balls, juggling soccer ball, jumpstyle dancing, kicking field goal, kicking soccer ball, kissing, kitesurfing, knitting, krumping, laying bricks, letting go of balloon, licking, lifting hat, lighting candle, lighting fire, longboarding, long jump, looking at phone, looking in mirror, making a sandwich, making bubbles, making jewelry, making latte art, making snowman, making tea, marching, massaging back, massaging legs, massaging person’s head, metal detecting, milking cow, milking goat, mixing colours, moon walking, motorcycling, moving baby, mowing lawn, mushroom foraging, needle felting, opening coconuts, opening door, opening present, opening refrigerator, opening wine bottle, peeling apples, person collecting garbage, petting cat, photocopying, planting trees, playing accordion, playing badminton, playing bagpipes, playing beer pong, playing billiards, playing blackjack, playing chess, playing clarinet, playing controller, playing cymbals, playing drums, playing field hockey, playing flute, playing harp, playing ice hockey, playing lute, playing maracas, playing marbles, playing nose flute, playing oboe, playing ocarina, playing piccolo, playing ping pong, playing polo, playing road hockey, playing rounders, playing scrabble, playing shuffleboard, playing slot machine, playing squash or racquetball, playing tennis, playing ukulele, playing violin, playing volleyball, playing xylophone, poking bellybutton, pole vault, polishing furniture, polishing metal, popping balloons, pouring milk, pouring wine, preparing salad, presenting weather forecast, pulling espresso shot, pumping gas, punching person (boxing), pushing wheelbarrow, pushing wheelchair, push up, putting on foundation, putting on lipstick, putting on mascara, reading book, repairing puncture, riding a bike, riding mule, riding or walking with horse, riding scooter, riding snow blower, riding unicycle, roasting marshmallows, rolling eyes, sailing, scrapbooking, scrubbing face, sewing, shaking hands, shaking head, shaping bread dough, sharpening pencil, shaving legs, shearing sheep, shining flashlight, shoot dance, shooting goal (soccer), shredding paper, shucking oysters, shuffling cards, shuffling feet, side kick, silent disco, situp, skateboarding, skiing crosscountry, slacklining, slapping, sled dog racing, smashing, smoking hookah, snorkeling, spinning plates, stacking cups, stacking dice, steer roping, stomping grapes, stretching leg, surfing crowd, sweeping floor, swimming backstroke, swinging baseball bat, swinging on something, sword swallowing, talking on cell phone, tap dancing, tapping guitar, tapping pen, tasting wine, testifying, throwing snowballs, throwing tantrum, tickling, tie dying, tobogganing, tossing coin, tossing salad, trimming shrubs, trimming trees, twiddling fingers, tying shoe laces, unboxing, using a wrench, using circular saw, using puppets, waiting in line, walking with crutches, washing hair, watching tv, watering plants, water skiing, water sliding, waving hand, waxing armpits, waxing eyebrows, weaving basket, windsurfing, winking, wood burning (art), writing, yawning, yoga, zumba.

a.3 Open-set Action Localisation

Here, we initiate two evaluation settings on THUMOS14 and ActivityNet1.3 datasets: (A) train on 75% action categories and test on the left 25% action categories; (B) train on 50% categories and test on the remaining 50% categories. For setting (A) on THUMOS14, the number of training categories is , and the number of testing categories is . For setting (B) on THUMOS14, the number of training categories is , and the number of testing categories is . For setting (A) on ActivityNet1.3, the number of training categories is , and the number of testing categories is . For setting (B) on ActivityNet1.3, the number of training categories is , and the number of testing categories is .

Under each setting, we conduct random samplings to split the action categories for training and testing. Note that, since the untrimmed videos in localisation are normally minutes long, splitting datasets based on action categories may incur some situations, where the same video contains both training and testing categories. For this multi-label video, we simply divide it into two videos, one containing only training categories and the other containing only testing categories.

Appendix B Implementation Details

In this paper, all prompt vectors and visual features are of the same dimension, , and temperature hyper-parameter is set to . For action recognition and localisation (2nd-stage), we evaluate different numbers of prompt vectors, and adopt the pattern eventually, meaning, random vectors are prepended / appended to the input text tokens, and optimised for the considered tasks. For text-video retrieval, as the text description tends to be long, we use prompt vectors. In terms of spatial pre-processing, we resize the frame’s short side to , while keeping its original aspect ratio, then perform center cropping to convert the spatial size to .

b.1 Action Recognition

For action recognition (and text-video retrieval), at inference time, we random sample frames from each video for times, and take the average of these results as the final predictions, i.e. -crop evaluation.

b.2 Action Localisation

For action localisation, to obtain class-agnostic action proposals, we adopt the off-the-shelf proposal detectors [37, 74]. Specifically, we first divide the entire video into equal-frame snippets; use the CLIP image encoder with one Transformer layer to extract frame-wise embeddings; feed these embeddings to the -layer feature pyramid; use parallel prediction heads to determine the actionness, centerness, and boundaries respectively; finally, assemble all prediction results and use Soft-NMS [4] to suppress redundant proposals. On THUMOS14, we downsample each video to fps, and frames are used to construct the snippet; As for ActivityNet1.3, we maintain the original video frame rate, and use frames in each snippet. The proposal detector is optimised using AdamW [44] with a learning rate of , and a batch size of videos. The optimisation objective is a combination of boundary regression and two binary classifications, i.e. actionness and centerness. For post-processing, we set the tIoU threshold in Soft-NMS to on THUMOS14, and on ActivityNet1.3.

b.3 Text-Video Retrieval

For text-video retrieval, the videos are decoded with fps, and we take the -frame input with a random frame gap (, that is to say, the video is equivalent to be sampled with - fps. Note that, here we adopt significantly lower fps than action recognition, as the retrieval task tends to require information from long-term visual dependencies.

Appendix C Experimental Results

In this section, we provide more experimental results to further analyse our model.

c.1 Action Localisation

We adopt the two-stage paradigm to achieve action localisation, i.e. first-stage proposal detection and second-stage proposal classification. In this section, we separately evaluate the performance of these two stages in the closed-set and open-set scenarios, to dissect localisation results.

c.1.1 Proposal Detection

To evaluate class-agnostic action proposals, we adopt conventional metrics Average Recall with different Average Number, i.e.

AR@AN. On THUMOS14, the AR is calculated under multiple IoU threshold set from 0.5 to 1.0 with a stride of 0.05. As for ActivityNet1.3, the multiple IoU threshold are from 0.5 to 0.95 with a stride of 0.05. And for the open-set settings with multiple sampling trials, we average the AR of all trials.

Table 8 shows the comparison results. On both datasets, the performance of the open-set scenario decreases compared with that of the closed-set scenario, showing that the action proposal is in fact not perfectly class-agnostic, it is still biased towards seen action categories. Moreover, since each video on THUMOS14 contains denser action instances, the number of which is times than that of ActivityNet1.3, the performance drop on THUMOS14 is more significant.

THUMOS14 ActivityNet1.3
Scenario Training Ratio AR@50 AR@100 AR@100
Closed-set 100% 32.4 38.3 63.6
Open-set 75% 24.1 29.7 60.8
50% 21.2 26.2 59.3
Table 8: Results of proposal detection. In the closed-set scenario, we train and evaluate on the same action categories. While in the open-set scenario, we experiment with two settings: training with 75% (25%) action categories and testing on the remaining 25% (50%) action categories.

c.1.2 Proposal Classification

We eliminate the action proposals that are completely disjoint with all ground-truth action instances, and evaluate the standard TOP1 classification accuracy among the remaining action proposals.

Table 9 shows the average accuracy of multiple sampling trials. Comparing to the closed-set evaluation, the open-set classification accuracy tends to drop. Note that, the setting training with 75% action categories on THUMOS14 is a special case. Since THUMOS14 has total action categories, in this setting, the number of testing categories is only , thus the classification task is easier than the closed-set scenario.

THUMOS14 ActivityNet1.3
Scenario Training Ratio train / test TOP1 train / test TOP1
Closed-set 100% 20 / 20 88.7 200 / 200 85.6
Open-set 75% 15 / 5 93.4 150 / 50 81.5
50% 10 / 10 87.3 100 / 100 71.8
Table 9: Results of proposal classification. In the closed-set scenario of THUMOS14 (ActivityNet1.3), we train and test on the same 20 (200) categories. While in the open-set scenario, we experiment with two settings, training with 75% (25%) action categories and testing on the remaining 25% (50%) action categories, e.g. training on 15 (10) categories and testing on the left 5 (10) categories for THUMOS14.

c.1.3 Summary

Overall, the localisation performance drop of the open-set scenario comes from two sources: one is the recall drop from the first-stage action proposals, the other comes from the second-stage classification errors.

c.2 Text-Video Retrieval

Here, we add more comparison results for the retrieval benchmarks. Note that, we try to compare with the results reported in the existing work, however, this is by no means to be fair comparisons, as in retrieval, models are usually pre-trained on different datasets with variable sizes.

MSRVTT (9K) LSMDC SMIT
Method Prompt Temporal E2E R@1 R@5 R@10 MdR R@1 R@5 R@10 MdR R@1 R@5
CE [43] 21.7 51.8 65.7 5.0 12.4 28.5 37.9 21.7
MMT [18] 24.6 54.0 67.1 4.0 13.2 29.2 38.8 21.0
TT-CE+ [10] 29.6 61.6 74.2 3.0 17.2 36.5 46.3 13.7
SMiT-Baseline [50] 33.1 64.8 77.4 39.5 65.7
MDMMT [12] 38.9 69.0 79.7 2.0 18.8 38.5 47.9 12.3
Baseline-IV 31.2 53.7 64.2 4.0 11.3 22.7 29.2 56.5 39.3 62.8
Ours 4+X+4 2-TFM 36.7 64.6 76.8 2.0 13.4 29.5 40.3 18.6 66.6 87.8
Frozen [1] 31.0 59.5 70.5 3.0 15.0 30.8 39.8 20.0
CLIP4Clip [45] 44.5 71.4 81.6 2.0 22.6 41.0 49.1 11.0
Table 10: Results of text-video retrieval. Baseline-IV refers to the original CLIP model with text query naïvely encoded, i.e. without using any prompt. E2E denotes if the model has been trained end-to-end. We highlight the results that are not end-to-end finetuned, where the best and second-best results are highlighted with bold and underline respectively.

As can be observed in Table 10, our model with learnable prompt vectors largely outperforms the Baseline-IV, which refers to the original CLIP model with text query naïvely encoded, i.e. without using any prompt. Despite only optimising a few parameters, the results are sometimes comparable to the state-of-the-art methods that have been specifically designed for retrieval.