A Straightforward Framework For Video Retrieval Using CLIP

02/24/2021 ∙ by Jesús Andrés Portillo-Quintero, et al. ∙ Tecnológico de Monterrey 0

Video Retrieval is a challenging task where a text query is matched to a video or vice versa. Most of the existing approaches for addressing such a problem rely on annotations made by the users. Although simple, this approach is not always feasible in practice. In this work, we explore the application of the language-image model, CLIP, to obtain video representations without the need for said annotations. This model was explicitly trained to learn a common space where images and text can be compared. Using various techniques described in this document, we extended its application to videos, obtaining state-of-the-art results on the MSR-VTT and MSVD benchmarks.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Video is one of the most consumed forms of media available on the internet. The high consumption of this type of media requires to find suitable methods for finding videos that contain one or more features desired by the users. Most video browsers rely on annotations made by users to identify video contents. Although this solution is simple to implement, it comes at a high price. Relying on annotations to perform a query on videos requires an extensive description of the videos’ innards and context. Unfortunately, this information may not be made available. Thus, it is clear that a video retrieval system that can handle user’s queries without the need for such annotations represents a relevant topic of study.

This document describes a video retrieval model, which, as its name implies, can retrieve the videos from a collection that are best described by a particular query (text). For example, “A woman is running” should return videos that contain women running. Given that the Video Retrieval architecture estimates the similarity between video and text, it can also be used to perform the video-to-text retrieval (VTR) task. It consists of returning captions that best describe the query (video) from a set of description candidates. In either task, the system goal is that, given a query and a set of video-text pairs, it must return the ranking at which the corresponding opposite modality is positioned.

The TVR and VTR tasks can be seen as a method by which video and text contents are funnelled into a fixed-length representation using an embedding function. Since both projections fall in the same dimensional space, a similarity score can be applied, which consequently can be used to rank elements from a set of prospects. Given that similarity metrics between text-video and video-text are equal, TVR and VTR are considered inverse operations. They only depend on the modality of the input prompt.

Some works extensively focus on the video representation by adding pre-trained models considered “experts”. Each “expert” focuses on specific video contents such as sound, face detection, motion, among others. The information from all the experts is multiplexed by a complex gating mechanism 

[5, 7]. Instead of starting from an elaborated video representation to train a common visual-text space, we propose to use a learned visual-text space to build a video representation. Similarly to Mithun et al. [12], our approach consists of using pre-trained models that measure the similarity between image and text. Then, we extend this idea to handle videos. We experimented with several aggregation methods to comply with the extra temporal dimension.

In this work, we chose CLIP as the base image-text model. CLIP is a state-of-the-art Neural Network, which is pre-trained for image-text pairs 

[14]. CLIP has proved that similarity learning can be used to train a visual encoder for downstream tasks such as classification, captioning, and clustering, to mention some. We harness the power of its visual representations to create a video representation that can be used directly with its original text encoder to bootstrap a Neural Network model for Video Retrieval. Since our work focuses on aggregation strategies of image features, our method is tested with Zero Shots of the evaluation dataset. Hence, no parameter finetuning is exercised to improve retrieval results.

The remainder of this document is organized as follows. In Section 2 we provide the foundations of this investigation and an overview of the most relevant related works. Section 3 describes the experiments conducted, their main results and their discussion. Finally, in Section 4 we present the conclusion and some ideas that may be worth exploring as part of the future work.

2 Background and Related Work

The work presented in this document is related to strategies used to construct a video encoder for video retrieval. It is straight forward to think that image features can serve as a proxy for video representations. In fact, Karpathy et al. [6]

observed that a Convolutional Neural Network (CNN) feature from a single frame could be discriminative enough for video classification, achieving just 1.3 fewer percentage points than the top accuracy model from the same work, which on its part included more visual and temporal information.

Mithun et al. [12] proved that it was possible to supersede the state-of-the-art video retrieval model by obtaining the average visual features obtained from an image-text model. This practice has been implemented on novel models, along with more elaborated video representations. For instance, the state-of-the-art in video retrieval has been pushed by models that implement a Mixture-of-Experts (MoE) paradigm [5, 7, 10, 13]. The MoE approach proposes a complex video representation by multiplexing the outputs of several pre-trained models (known as “experts”) that attend to particular aspects of video such as motion, face detection, character recognition, among others.

In this regard, we are aware that at most seven experts have been included in a Video Retrieval model [5]. Nonetheless, the current state-of-the-art implements a mixture of two experts, indicating that video-text representations may rescind the added complexity that multiple experts convey [13]. Patrick et al. [13] propose that Contrastive Training used by most video retrieval systems encourages repulsive forces on independent, but similar, examples. To alleviate this, they use a support set containing positive examples for each data point on a training batch, so the common video-text space must learn concept sharing. Nonetheless, contrastive training has been proved successful in image and video representation learning [2, 9].

Contrastive training is a regime on which a model is inducted to pull similar data points together and push apart dissimilar ones on a latent space. The foundational mechanism of the Contrastive Language-Image Pretraining (CLIP) is the model used in this work. As its name states, the model is pre-trained on 400,000 image-text pairs collected from the Internet. As a siamese neural network, it is composed of an image (ViT-B/32) and text encoder (transformer) that funnel information into a common space where objects can be compared using cosine similarity 


3 Experiment and Results

This section describes a mathematical description of CLIP and how we can use it for VTR or TVR. Then, we describe the datasets and metrics considered for this work. Then, we detail the experiments and their main results, followed by a brief discussion of the most relevant findings.

3.1 CLIP as Video Representation

By using CLIP we obtain the pre-trained functions and , which encode image and text into respectively. Assume a video is composed of sampled frames such that . Consequently, we can calculate the embedding of each frame into a matrix so . Therefore, the problem we try to solve is to find an aggregation function that maps the input into a video representation . Then, with a video and text representations and , we can compute a cosine similarity function (Equation 1), which is useful for ranking the video-text pairs inside a dataset given a query of a specific modality.


3.2 Datasets

The proposed framework assumes a set of videos and corresponding captions pairs in the form where the number of captions per video may be non-uniform, hence is a function of . By design, some datasets are split into sections used for training and validation of results. For the preliminary experiments, we use the training splits to prove our hypothesis, but final results are reported on tests split of their respective datasets.

The datasets involved in this work are listed below.


is a dataset composed of 10,000 videos, each with a length that ranges from ten to 32 seconds and 200,000 captions. The training, validation and test splits are composed of 6,513, 497 and 2,990 videos, respectively, with 20 corresponding descriptions each [18]. The test set has been used in different ways in the literature. Then, we will refer to two common variations as Full [7] (containing all the 2,990 videos in the test set from MSR-VTT) and 1k-A [19] (containing only 1000 videos from the 2,990 in the test set in MSR-VTT).


contains 1970 videos, each with a length that ranges from one to 62 seconds. Train, validation and test splits contain 1200, 100 and 670 videos, respectively [1]. Each video has approximately 40 associated sentences in English.


is comprised 118,081 videos, each with a length that ranges from two to 30 seconds. The videos were extracted from 202 movies. The validation set contains 7,408 videos, and the test set 1,000 videos from movies independent from the training and validation splits [15].

All the frames were sampled from each video from the previously mentioned datasets to extract the frame features. Other datasets are related to this work but cannot be used include WIT (WebImageText) [14] and HT100M [11]. WIT is composed of 400,000 image-text pairs on which CLIP was trained on. Since WIT is an image-text dataset that cannot be used as a benchmark for video retrieval. HT100M is a dataset of 100 million video-text pairs, used only as a pre-training set for other Video Retrieval works [5, 11, 13, 16].

3.3 Metrics

To conduct our experiments, we follow the testing methodologies used in previous works [5, 7]

and report standard retrieval metrics. For median rank (MdR), mean rank (MnR) and standard deviation of rank (StdR), the lower the value, the better the performance. In the case of recall at rank (

, where ), the higher the value, the better the performance. For datasets that involve multiple sentences per video —such as Full from MSR-VTT and MSVD test—, we follow the protocol used by Liu et al. [7] and use the minimum rank among all associated sentences to a given video query.

3.4 Exploratory Experiments

In the exploratory experiments, we empirically define two candidates for frame-level aggregation functions. We conduct this set of preliminary experiments on a validation sample comprised of 1,000 video-text pairs from MSR-VTT. The first frame-level aggregation function is based on the idea that it is feasible to obtain reasonable video representations by only considering one frame sample [6]. Given the feature matrix , we define as a function that returns the features of the 30th frame. Since these videos contain approximately 30 frames per second, this is equivalent to sampling a frame from the first second of the video.

A second candidate for an aggregation function is proposed by Mithun et al. [12], who suggest that the average of frame-level features from videos can be used as an approximator for video representations. This method has extensively been used in other retrieval works [5, 7, 9, 11, 13]. Consequently, we define , where is the average value of matrix columns.

Given that videos present dynamic events in which several sequences of frames can represent completely different things, we used -means as the method for aggregation [17]. With this implementation, the aggregation function follows the form , which returns video embeddings. For evaluation purposes, we repeat the ranking procedure with the obtained independent video representations times and register each query’s minimum rank, then calculate the retrieval metrics.

R@1 R@5 R@10 MdR MnR StdR
24.9 46.1 56.9 7.0 64.61 149.21
35.4 58.0 67.2 3.0 39.81 111.43
34.3 57.8 66.5 3.0 40.23 112.85
34.4 57.7 66.6 3.0 39.77 110.69
33.7 58.4 66.9 3.0 37.98 107.53
34.4 57.6 66.1 3.0 38.44 108.02
34.9 58.4 67.6 3.5 37.44 108.34
35.3 58.1 67.5 4.0 38.33 107.88
33.9 57.7 67.9 3.0 38.23 107.32
33.9 57.2 67.1 3.0 37.87 108.23
35.0 57.8 68.0 3.0 37.26 107.34
Table 1: Text-to-Video Retrieval results on the MSR-VTT validation set, using different aggregation functions.

Based on the results depicted in Table 1, the average-based methods obtain the best results in terms of the metrics used. It is noticeable that, among -means methods, there is no significant difference between the results. This may be because MSR videos do not exceed 32 seconds in length, which may not be enough to differentiate the centroids when creating the clusters. We appeal to Occam’s Razor principle regarding the aggregation method and select for further experiments since it accomplishes a similar performance to -means based aggregation methods but with a lower calculation complexity.

3.5 Confirmatory Experiments

This section compares our video retrieval model against the state-of-the-art in the MSR-VTT, MSVD and LSMDC datasets. In all the cases, we evaluate both the TVR and VTR tasks.

In MSR-VTT, we are able to supersede the R@1 score of the previous best model SSB [13] on the split 1k-A for the TVR task. However, we are positioned behind previous works on other recall metrics (Table 2). Besides, we consistently achieve state-of-the-art results on all the recall metrics in the Full split from MSR-VTT. In the MSVD dataset, we obtain state-of-the-art results on most of the retrieval metrics (Table 3). We suppose that models that are based on MoE such as SSB [13] and CE [7] cannot use all of their implemented experts because the videos in MSVD lack audio information, so they have to rely only on visual features. In LSMDC, we do not obtain state-of-the-art results, but we are positioned second-best (Table 4). Given that video descriptions in this dataset do not follow the form of a typical sentence, as they are designed to teach a model to recognize characters and interactions between movie scenes, we commend the robustness of CLIP’s text encoder because it could adapt to a new sentence schema.

Method Training Test Set R@1 R@5 R@10 MdR R@1 R@5 R@10 MdR
JSFusion [19] M 1k-A 10.2 31.2 43.2 13 - - - -
HT100M [11] H+M 1k-A 14.9 40.2 52.8 9 16.8 41.7 55.1 8
CE [7] M 1k-A 20.9 48.8 62.4 6 20.6 50.3 64 5.3
AVLnet [16] H+M 1k-A 27.1 55.6 66.6 4 28.5 54.6 65.2 4
MMT [5] H+M 1k-A 26.6 57.1 69.6 4 27.0 57.5 69.7 3.7
SSB [13] H+M 1k-A 30.1 58.5 69.3 3 28.5 58.6 71.6 3
CLIP W 1k-A 31.2 53.7 64.2 4 27.2 51.7 62.6 5
VSE [12] M Full 5.0 16.4 24.6 47 7.7 20.3 31.2 28
VSE++ [12] M Full 5.7 17.1 24.8 65 10.2 25.4 35.1 25
Multi Cues [12] M Full 7.0 20.9 29.7 38 12.50 32.10 42.4 16
W2VV [3] M Full 6.1 18.7 27.5 45 11.8 28.9 39.1 21
Dual Enc. [4] M Full 7.7 22.0 31.8 32 13.0 30.8 43.3 15
E2E [9] M Full 9.9 24.0 32.4 29.5 - - - -
CE [7] M Full 10.0 29.0 42.2 16 15.6 40.9 55.2 8.3
CLIP W Full 21.4 41.1 50.4 10 40.3 69.7 79.2 2
Table 2: TVR and VTR results in the MSR-VTT dataset. M, H and W denote training on MSR-VTT, HT100M and WIT, respectively.
Method Training R@1 R@5 R@10 MdR R@1 R@5 R@10 MdR
VSE [12] D 12.3 30.1 42.3 14 34.7 59.9 70.0 3
VSE++ [12] D 15.4 39.6 53.0 9 - - - -
Multi Cues [12] D 20.3 47.8 61.1 6 - - - -
CE [7] D 19.8 49.0 63.8 6 - - - -
Support-set Bottleneck [13] H+D 28.4 60.0 72.9 4 - - - -
CLIP W 34.2 60.1 70.9 3 56.4 81.5 90.2 1
Table 3: TVR and VTR results in the MSVD dataset. D, H and W denote training on MSVD, HT100M and WIT, respectively.
Method Training R@1 R@5 R@10 MdR R@1 R@5 R@10 MdR
JSFusion [19] L 9.1 21.2 34.1 36 12.3 28.6 38.9 20
CE [7] L 11.2 26.9 34.8 25.3 - - - -
MMT [5] H+L 12.9 29.9 40.1 19.3 - - - -
CLIP W 11.3 22.7 29.2 56.5 6.8 16.4 22.1 73
Table 4: TVR and VTR results in the LSMDC dataset. L, H and W denote training on LSMDC, HT100M and WIT, respectively.

3.6 Discussion

Although we obtain outstanding results on different metrics and datasets, there are some things worth discussing. For example, our original supposition was that the ranking worsens as the video gets longer. To confirm or reject this idea, we produced Figure 1. Figure 1 depicts the video length in seconds (-axis), and the rank assigned to it (-axis). As a video gets longer, we expected that it would be more difficult for the video representation to capture the temporal elements. Hence it would be ranked worse. However, the experiment conducted on set 1k-A from MSR-VTT shows that ranking varies wildly independently from video length (at least as long as the videos present in the dataset).

Figure 1: Scatter plot of video length and assigned rank on TVR task on the 1k-A test split. The red line represents the median rank.

We proceeded to look at the worst ranked video-text pairs, we noticed that several sentences incorporated phrases like “a family is having a conversation” or “a man talking about a woman” hinting that sentences that were mainly describing audio content would be ranked worse. This conclusion is reinforced by the fact that our model scored the best on MSVD, a dataset that by design does not contain any audio track and text descriptions are based on what can be visualized.

4 Conclusion and Future Work

This work presents the first implementation of CLIP to obtain video features. Our method works by leveraging its learned common image-text space without the need for parameter finetuning (Zero-Shot). We apply an aggregation function to frame-level features, common in other video retrieval works. Our work focuses only on visual and text modalities, as it supersedes methods that implement a complex mixture of pre-trained models obtaining state-of-the-art results on the MSVD and MSR-VTT datasets.

One potential application of this CLIP-derived implementation is to retrieve specific moments inside videos. Also, it is yet unseen how will our video representation behave if tested as a video classifier. This methodology might be useful to create a video representation that is based on CLIP for longer durations. For example, other works have used frame features to construct a graph that can change through time 

[8]. Such a representation could keep the strong text alignment suitable for video retrieval. Also, our work can be used as an expert on a future MoE video retrieval system.


This research was partially supported by ITESM Research Group with Strategic Focus on Intelligent Systems.


  • [1] D. Chen and W. B. Dolan (2011) Collecting highly parallel data for paraphrase evaluation. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp. 190–200. Cited by: item MSVD.
  • [2] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton (2020) A simple framework for contrastive learning of visual representations. In

    International conference on machine learning

    pp. 1597–1607. Cited by: §2.
  • [3] J. Dong, X. Li, and C. G. Snoek (2018) Predicting visual features from text for image and video caption retrieval. IEEE Transactions on Multimedia 20 (12), pp. 3377–3388. Cited by: Table 2.
  • [4] J. Dong, X. Li, C. Xu, S. Ji, Y. He, G. Yang, and X. Wang (2019) Dual encoding for zero-example video retrieval. In

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    pp. 9346–9355. Cited by: Table 2.
  • [5] V. Gabeur, C. Sun, K. Alahari, and C. Schmid (2020) Multi-modal transformer for video retrieval. In Computer Vision – ECCV 2020, A. Vedaldi, H. Bischof, T. Brox, and J. Frahm (Eds.), Cham, pp. 214–229. External Links: ISBN 978-3-030-58548-8 Cited by: §1, §2, §2, §3.2, §3.3, §3.4, Table 2, Table 4.
  • [6] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei (2014) Large-scale video classification with convolutional neural networks. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 1725–1732. Cited by: §2, §3.4.
  • [7] Y. Liu, S. Albanie, A. Nagrani, and A. Zisserman (2020) Use what you have: video retrieval using representations from collaborative experts. External Links: 1907.13487 Cited by: §1, §2, item MSR-VTT, §3.3, §3.4, §3.5, Table 2, Table 3, Table 4.
  • [8] F. Mao, X. Wu, H. Xue, and R. Zhang (2018) Hierarchical video frame sequence representation with deep convolutional graph network. In Proceedings of the European Conference on Computer Vision (ECCV) Workshops, pp. 0–0. Cited by: §4.
  • [9] A. Miech, J. Alayrac, L. Smaira, I. Laptev, J. Sivic, and A. Zisserman (2020) End-to-end learning of visual representations from uncurated instructional videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9879–9889. Cited by: §2, §3.4, Table 2.
  • [10] A. Miech, I. Laptev, and J. Sivic (2018) Learning a Text-Video Embedding from Incomplete and Heterogeneous Data. arXiv:1804.02516. Cited by: §2.
  • [11] A. Miech, D. Zhukov, J. Alayrac, M. Tapaswi, I. Laptev, and J. Sivic (2019) Howto100m: learning a text-video embedding by watching hundred million narrated video clips. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2630–2640. Cited by: §3.2, §3.4, Table 2.
  • [12] N. C. Mithun, J. Li, F. Metze, and A. K. Roy-Chowdhury (2018) Learning joint embedding with multimodal cues for cross-modal video-text retrieval. In Proceedings of the 2018 ACM on International Conference on Multimedia Retrieval, pp. 19–27. Cited by: §1, §2, §3.4, Table 2, Table 3.
  • [13] M. Patrick, P. Huang, Y. Asano, F. Metze, A. G. Hauptmann, J. F. Henriques, and A. Vedaldi (2021) Support-set bottlenecks for video-text representation learning. In International Conference on Learning Representations, External Links: Link Cited by: §2, §2, §3.2, §3.4, §3.5, Table 2, Table 3.
  • [14] Radford,Alec, K. Wook,Jong, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever (2021) Learning transferable visual models from natural language supervision. Cited by: §1, §2, §3.2.
  • [15] A. Rohrbach, M. Rohrbach, N. Tandon, and B. Schiele (2015) A dataset for movie description. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3202–3212. Cited by: item LSMDC.
  • [16] A. Rouditchenko, A. Boggust, D. Harwath, D. Joshi, S. Thomas, K. Audhkhasi, R. Feris, B. Kingsbury, M. Picheny, A. Torralba, and J. Glass (2020) AVLnet: learning audio-visual language representations from instructional videos. External Links: 2006.09199 Cited by: §3.2, Table 2.
  • [17] C. Sun, A. Myers, C. Vondrick, K. Murphy, and C. Schmid (2019) Videobert: a joint model for video and language representation learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7464–7473. Cited by: §3.4.
  • [18] J. Xu, T. Mei, T. Yao, and Y. Rui (2016) Msr-vtt: a large video description dataset for bridging video and language. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5288–5296. Cited by: item MSR-VTT.
  • [19] Y. Yu, J. Kim, and G. Kim (2018) A joint sequence fusion model for video question answering and retrieval. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 471–487. Cited by: item MSR-VTT, Table 2, Table 4.