DeepAI
Log In Sign Up

X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text Retrieval

07/15/2022
by   Yiwei Ma, et al.
0

Video-text retrieval has been a crucial and fundamental task in multi-modal research. The development of video-text retrieval has been considerably promoted by large-scale multi-modal contrastive pre-training, which primarily focuses on coarse-grained or fine-grained contrast. However, cross-grained contrast, which is the contrast between coarse-grained representations and fine-grained representations, has rarely been explored in prior research. Compared with fine-grained or coarse-grained contrasts, cross-grained contrast calculate the correlation between coarse-grained features and each fine-grained feature, and is able to filter out the unnecessary fine-grained features guided by the coarse-grained feature during similarity calculation, thus improving the accuracy of retrieval. To this end, this paper presents a novel multi-grained contrastive model, namely X-CLIP, for video-text retrieval. However, another challenge lies in the similarity aggregation problem, which aims to aggregate fine-grained and cross-grained similarity matrices to instance-level similarity. To address this challenge, we propose the Attention Over Similarity Matrix (AOSM) module to make the model focus on the contrast between essential frames and words, thus lowering the impact of unnecessary frames and words on retrieval results. With multi-grained contrast and the proposed AOSM module, X-CLIP achieves outstanding performance on five widely-used video-text retrieval datasets, including MSR-VTT (49.3 R@1), MSVD (50.4 R@1), LSMDC (26.1 R@1), DiDeMo (47.8 R@1) and ActivityNet (46.2 R@1). It outperforms the previous state-of-theart by +6.3 these benchmarks, demonstrating the superiority of multi-grained contrast and AOSM.

READ FULL TEXT VIEW PDF
10/10/2022

Contrastive Video-Language Learning with Fine-grained Frame Sampling

Despite recent progress in video and language representation learning, t...
07/10/2019

A New Benchmark and Approach for Fine-grained Cross-media Retrieval

Cross-media retrieval is to return the results of various media types co...
11/14/2022

Composed Image Retrieval with Text Feedback via Multi-grained Uncertainty Regularization

We investigate composed image retrieval with text feedback. Users gradua...
05/16/2020

Radial Loss for Learning Fine-grained Video Similarity Metric

In this paper, we propose the Radial Loss which utilizes category and su...
09/24/2021

Contextual Fine-to-Coarse Distillation for Coarse-grained Response Selection in Open-Domain Conversations

We study the problem of coarse-grained response selection in retrieval-b...
11/16/2022

An Efficient COarse-to-fiNE Alignment Framework @ Ego4D Natural Language Queries Challenge 2022

This technical report describes the CONE approach for Ego4D Natural Lang...
11/16/2021

Multi-Vector Models with Textual Guidance for Fine-Grained Scientific Document Similarity

We present Aspire, a new scientific document similarity model based on m...

1. Introduction

Video-text retrieval (VTR) is a multi-modal task, which aims to find the most relevant video/text based on the text/video query. With the explosive growth of videos on the Internet, VTR has attracted increasing interests and served as an important role in people’s daily life. Recent years have witnessed the rapid development of VTR, which is supported by a series of pre-training multi-modal models (Radford et al., 2021; Lei et al., 2021; Bain et al., 2021), innovative retrieval methods (Yu et al., 2018; Zhu and Yang, 2020; Lei et al., 2021; Liu et al., 2019; Gabeur et al., 2020; Dzabraev et al., 2021; Mithun et al., 2018; Zhang et al., 2018; Liu et al., 2021; Dong et al., 2019; Bertasius et al., 2021; Arnab et al., 2021; Luo et al., 2021; Jin et al., 2021; Yang et al., 2020; Wang et al., 2021) and video-text benchmarks (Caba Heilbron et al., 2015; Xu et al., 2016; Chen and Dolan, 2011; Rohrbach et al., 2015; Anne Hendricks et al., 2017).

Recently, with great success in large-scale contrastive language-image pre-training, VTR has also achieved great progress. Specifically, with 400M image-text pairs for training, CLIP (Radford et al., 2021) can embed the images and sentences into the shared semantic space for similarity calculation. Furthermore, CLIP4Clip (Luo et al., 2021) transfers the image-text knowledge of CLIP to the VTR task, resulting in significant performance improvements on several video-text retrieval datasets. However, CLIP and CLIP4Clip embed the whole sentence and image/video into textual and visual representations, thus lacking the ability to capture fine-grained interactions. To this end, some previous works (Yao et al., 2021; Lee et al., 2018) propose fine-grained contrastive frameworks, which consider the contrast between each word of the sentence and each frame of the video. Moreover, TACo (Yang et al., 2021) introduces token-level and sentence-level loss to consider both fine-grained and coarse-grained contrast. Although they have shown promising advances on the VTR task, cross-modality semantic contrast still needs to be systematically explored.

Figure 1. X-CLIP aims for improving video-text retrieval performance via multi-grained contrastive learning, including fine-grained (frame-word), coarse-grained (video-sentence) and cross-grained (video-word, sentence-frame) contrast. The transparency of words and frames represents the degree of relevance to query.

As shown in Fig. 1, a video is composed of multiple frames, and a sentence consists of several words. Video and sentence are usually redundant, which may contain some unnecessary frames or unimportant words. Concretely, given a specific video or sentence query, unnecessary frames or unimportant words refer to the candidates with low relevance to the query (i.e., light-colored frames and words in Fig. 1). However, most current works mainly focus on coarse-grained contrast (Radford et al., 2021; Luo et al., 2021), fine-grained contrast (Yao et al., 2021; Lee et al., 2018) or both (Yang et al., 2021), which are inefficient in filtering out these unnecessary frames and words. Specifically, coarse-grained contrast calculates the similarity between video-level and sentence-level features, and fine-grained contrast calculates the similarity between frame-level and word-level features. To this end, we ask: How to effectively filter out unnecessary information during retrieval? To answer this question, we propose the cross-grained contrast, which calculates the similarity score between the coarse-grained features and each fine-grained feature. As shown in Fig. 1, with the help of the coarse-grained feature, unimportant fine-grained features will be filtered out and important fine-grained features will be up-weighted. However, challenges in cross-grained contrast arise from aggregating similarity matrices to instance-level similarity scores. A naive and easy method is to use Mean-Max strategy (Yao et al., 2021; Khattab and Zaharia, 2020; Santhanam et al., 2021; Khattab et al., 2021) to calculate the instance-level similarity score after obtaining the similarity matrix. However, the conventional Mean-Max strategy is not conducive to filtering out the unnecessary information in videos and sentences during retrieval. On one hand, Mean applies the same weight to all frames and words, so the contrast between unnecessary frames and unimportant words may harm the retrieval performance. On the other hand, Max only considers the most important frame and word, ignoring other critical frames and words.

Based on the above analysis, in this paper, we propose an end-to-end multi-grained contrast model, namely X-CLIP, for video-text retrieval. Specifically, X-CLIP first adopts modality-specific encoders to generate multi-grained visual and textual representations and then considers multi-grained contrast of features (i.e.,

video-sentence, video-word, sentence-frame, and frame-word) to obtain multi-grained similarity scores, vectors, and matrices. To effectively filter out the unnecessary information and obtain meaningful instance-level similarity scores, the AOSM module of X-CLIP conducts the attention mechanism over the similarity vectors/matrices. Different from the conventional

Mean-Max strategy, our proposed AOSM module dynamically considers the importance of each frame in the video and each word in the sentence, so the adverse effects of unimportant words and unnecessary frames on retrieval performance are reduced.

To validate the effectiveness of our proposed X-CLIP, we conduct extensive experiments on five widely-used video-text retrieval benchmarks and achieve significantly better performance than previous approaches. Specifically, our X-CLIP achieves 49.3 R@1 on MSR-VTT (i.e., 6.3% relative improvement, 2.9% absolute improvement over the previous state-of-the-art approach). Besides, our proposed X-CLIP achieves 50.4 R@1, 26.1 R@1, 47.8 R@1, 46.2 R@1 on the MSVD, LSMDC, DiDeMo and ActivityNet datasets, respectively, which outperforms the previous SOTA method by +6.6% (+3.1%), +11.1% (+2.6%), +6.7% (+3.0%), +3.8% (+1.7%) on relative (absolute) improvement.

2. Related Works

2.1. Vision-Language Pre-Training

With the success of self-supervised pre-training such as BERT (Devlin et al., 2018) in NLP, vision-language pre-training on large-scale unlabeled cross-modal data has attracted growing attention (Lu et al., 2019; Xu et al., 2021; Tan and Bansal, 2019; Li et al., 2020b; Yu et al., 2020; Li et al., 2021; Radford et al., 2021; Jia et al., 2021; Sun et al., 2019b; Li et al., 2020a). One line of work such as LXMERT (Tan and Bansal, 2019), OSCAR (Li et al., 2020b) and ALBEF (Li et al., 2021) focuses on pre-training on enormous image-text pairs data, and obtains significant improvement in a variety of vision-and-language tasks. To better cope with the image-text retrieval tasks, contrastive language-image pre-training methods such as CLIP (Radford et al., 2021), ALIGN (Jia et al., 2021) and WenLan (Huo et al., 2021) have been proposed, by leveraging billion-scale image-text pairs data from the web with a dual-stream Transformer. Due to the great advantage of CLIP for visual representation learning, some recent work such as CLIP4Clip (Luo et al., 2021) has also begun to transfer the knowledge of CLIP to video-text retrieval tasks and obtained new state-of-the-art results. The other line of work such as VideoBERT (Sun et al., 2019b), HERO (Li et al., 2020a) and Frozen in Time (Bain et al., 2021) directly collects video-text pairs data for video-language pre-training, by further considering the temporal information in videos. However, the scale of the video-language pre-training dataset is much smaller than image-text pre-training since the process of video-text dataset collection is much more expensive. In this work, we follow the line of CLIP4Clip (Luo et al., 2021), which enhances video-text retrieval by borrowing the ability of visual representation learning from contrastive image-text pre-training. Different from CLIP4Clip (Luo et al., 2021), we design a multi-grained video-text alignment function to better align the video-text semantics.

2.2. Video-Text Retrieval

Video-text retrieval is a popular but challenging task, which involves cross-modal fusion of multiple modalities and additional understanding of temporal information in videos. Traditional video-text retrieval methods tend to design task-specific or modality-specific fusion strategies for cross-modal learning from offline extracted video and text features (Yu et al., 2017; Gabeur et al., 2020; He et al., 2021; Liu et al., 2021; Patrick et al., 2020; Jang et al., 2017; Le et al., 2020)

, including face recognition/object recognition/audio processing. However, they are limited by the pre-extracted single modal features, since these features are not properly learnt for the target downstream tasks. Recently, the paradigm of end-to-end video-text retrieval by training models directly from raw video/text has gained large popularity. For example, MIL-NCE 

(Miech et al., 2020)

adopts Multiple Instance Learning and Noise Contrastive Estimation for end-to-end video representation learning, which addresses visually misaligned narrations from uncurated videos. ClipBERT 

(Lei et al., 2021) proposes to sparsely sample video clips for end-to-end training to obtain clip-level predictions, while Frozen in Time (Bain et al., 2021) uniformly samples video frames and conducts end-to-end training on both image-text and video-text pairs data. CLIP4Clip (Luo et al., 2021) transfers the knowledge of CLIP to end-to-end video-text retrieval and investigates three similarity calculation approaches for video-sentence contrastive learning. However, cross-grained (i.e., video-word and sentence-frame) contrast is also critical, which has rarely been explored in previous works. We propose the first work of multi-grained contrastive learning for end-to-end video-text retrieval, by considering all the video-sentence, video-word, sentence-frame, and frame-word contrasts.

2.3. Multi-Grained Contrastive Learning

Recently, contrastive learning (Chen et al., 2020a; He et al., 2020; Chen et al., 2020b, 2021)

has been a popular topic in deep learning community. CLIP

(Radford et al., 2021) implements the idea of contrastive learning based on a large number of image-text pairs, achieving outstanding performance on several multi-modal downstream tasks (Zhang et al., 2021; Ji et al., 2022; Ma et al., 2022; Zhu et al., 2022; Ji et al., 2021; He et al., 2022). To achieve fine-grained contrastive learning, FILIP (Yao et al., 2021) contrasts the patch in the image with the word in the sentence, achieving fine-grained semantic alignment. TACo (Yang et al., 2021) proposes token-level and sentence-level losses to include both fine-grained and coarse-grained contrasts. Although contrastive learning has been widely used in multi-modal pre-training, cross-grained contrast has rarely been explored in previous works, which is also critical for semantic alignment. Therefore, we propose a multi-grained contrastive learning method for video-text retrieval, which aims to achieve multi-grained semantic alignment.

3. Methodology

In this section, we elaborate each component of our proposed X-CLIP, whose architecture is shown in Fig. 2. Specifically, we first introduce how to extract the multi-grained visual and textual representations in Sec. 3.1. We then explain the multi-grained contrastive learning based on these feature representations in Sec. 3.2, which aims to obtain multi-grained contrast scores, vectors, and matrices. We also introduce how to aggregate the similarity vectors/matrices to the instance-level similarity score in Sec. 3.3. Finally, we describe the similarity calculation and objective function for video-text retrieval in Sec. 3.4 and 3.5, respectively.

Figure 2. Illustration of the proposed X-CLIP model. The input sentences are processed by the text encoder to generate coarse-grained and fine-grained textual representations. The input video is sampled into ordinal frames and these frames are fed into the frame encoder to generate frame-level representations. The frame-level representations are then fed into the temporal encoder to capture the temporal relationships. The outputs of the temporal encoder are fine-grained visual representations, and the coarse-grained visual representation is obtained by averaging all these fine-grained features. Based on these representations, we calculate the video-sentence, video-word, sentence-frame, and frame-word similarity score.

3.1. Feature Representation

3.1.1. Frame-level Representation

For a video , we first sample video frames using the sampling rate of 1 frame per second (FPS). Frame encoder is used to process these frames to obtain frame-level features, which is a standard vision transformer (ViT) with 12 layers. Following the previous work (Luo et al., 2021), we initialize our frame encoder with the public CLIP (Radford et al., 2021) checkpoints. The architecture of ViT is the same as the transformer (Vaswani et al., 2017)

encoder in natural language processing (NLP), except ViT introduces a visual tokenization process to convert video frames into discrete token sequences. The discrete token sequence, which is prepended with a [CLS] token, is then fed into the Transformer of ViT. The [CLS] tokens from the last layer are extracted as the frame-level features

.

3.1.2. Visual Representation

However, are extracted from separate frames, without considering the interaction among frames. Therefore, we further propose a temporal encoder with temporal position embedding , which is a set of predefined parameters, to model the temporal relationship. To be specific, the temporal encoder is also a standard transformer with 3 layers, which can be formulated as:

(1)

where is the final frame-level (fine-grained) visual features for the video , is the number of frames in the video . To obtain video-level (coarse-grained) visual feature , all frame-level features of the video are averaged, which can be formulate as:

(2)

3.1.3. Textual Representation

Given a sentence, we directly use the text encoder of CLIP to generate the textual representation, which is also initialized by the public checkpoints of CLIP (Radford et al., 2021). Specifically, it is a transformer encoder, which consists of multi-head self-attention and feed-forward networks. The transformer consists of 12 layers and 8 attention heads. The dimension of the query, key, and value features is 512. The tokenizer used in the experiment is lower-cased byte pair encoding (BPE) (Sennrich et al., 2015)

with a 49,152 vocab size. Before being fed into the text encoder, the textual token sequence is padded with [BOS] and [EOS] at the beginning and end, respectively. The sentence-level (coarse-grained) textual feature

and word-level (fine-grained) textual features are the outputs of the [EOS] token and corresponding word tokens from the final layer of text encoder, where is the length of the sentence.

3.2. Multi-Grained Contrastive Learning

Previous VTR works (Luo et al., 2021; Lee et al., 2018) focus on fine-grained and coarse-grained contrastive learning, which include video-sentence and frame-word contrasts. However, as explained in Sec. 1, cross-grained (i.e., video-word and sentence-frame) contrast is explicit to filter out the unnecessary information in the video and sentence. Therefore, different from previous works (Luo et al., 2021; Lee et al., 2018; Yao et al., 2021), which only focus single-grained contrast, X-CLIP is a multi-grained contrastive framework for VTR.

3.2.1. Video-Sentence Contrast

Given the video-level representation and sentence-level representation , we use matrix multiplication to evaluate the similarity between video and sentence, which can be formulated as:

(3)

where is the video-sentence similarity score. 111For clarity and simplicity, we have omitted the frame (word) index and video (sentence) index of visual (textual) representations.

3.2.2. Video-Word Contrast

For the given video-level representation and word-level representation vector , we use matrix multiplication to calculate the similarity between the video representation and each word representation, which can be represented as follows:

(4)

where is the similarity vector between video and each word in the sentence, is the length of the sentence.

3.2.3. Sentence-Frame Contrast

Similar to Video-Word Contrast, we can calculate the similarity between the sentence representation and each frame representation based on matrix multiplication, which can be formulated as follows:

(5)

where is the similarity vector between the sentence and each frame of a video, is the number of frames in the video.

3.2.4. Frame-Word Contrast

The fine-grained similarity matrix between word representations and frame representations can be also obtained using the matrix multiplication:

(6)

where is the fine-grained similarity matrix, and are the number of frames and words, respectively.

3.3. Attention Over Similarity Matrix (AOSM)

To obtain the instance-level similarity, we fuse the similarity vector/matrix in Eq. 4, Eq. 5 and Eq. 6. As discussed in Sec. 1, Mean-Max strategies (Yao et al., 2021; Khattab and Zaharia, 2020; Santhanam et al., 2021; Khattab et al., 2021) ignore the importance of different frames and words. To address this issue, we propose the Attention Over Similarity Matrix (AOSM) module, where scores in similarity vectors/matrices will be given different weights during aggregation.

Specifically, given the similarity vectors and , we first use Softmax to obtain the weights for the similarity vector, where scores for the fine-grained features related to the query will be given high weights. Then, we aggregate these similarity scores based on the obtained weights, which can be formulated as follows:

(7)
(8)

where is the temperature parameter of Softmax.

Since the fine-grained similarity matrix contains the similarity scores of frames and words, we perform attention operations on the matrix twice. The first attention aims to get fine-grained video-level and sentence-level similarity vectors, which can be formulated as follows:

(9)
(10)

where represents all content in the dimension, and are the video-level and sentence-level similarity vector, respectively. Specifically, shows the similarity score between the video and words in the sentence. represents the similarity score between the sentence and frames in the video.

To obtain fine-grained instance-level similarity scores, we conduct the second attention operation on the video-level vector and sentence-level similarity vector , which can be represented as follows:

(11)
(12)

where and are the instance-level similarities. We use the average value as the fine-grained similarity score:

(13)
Text-to-Video Retrieval Video-to-Text Retrieval
Model R@1↑ R@5↑ R@10↑ MdR↓ MnR↓ R@1↑ R@5↑ R@10↑ MdR↓ MnR↓
CE (Liu et al., 2019) 20.9 48.8 62.4 6.0 28.2 20.6 50.3 64.0 5.3 -
MMT (Gabeur et al., 2020) 26.6 57.1 69.6 4.0 24.0 27.0 57.5 69.7 3.7 -
AVLnet (Rouditchenko et al., 2020) 27.1 55.6 66.6 4.0 - 28.5 54.6 65.2 4.0 -
SSB (Patrick et al., 2020) 30.1 58.5 69.3 3.0 - 28.5 58.6 71.6 3.0 -
MDMMT (Dzabraev et al., 2021) 38.9 69.0 79.7 2.0 16.5 - - - - -
Frozen (Bain et al., 2021) 31.0 59.5 70.5 3.0 - - - - - -
HiT (Liu et al., 2021) 30.7 60.9 73.2 2.6 - 32.1 62.7 74.1 3.0 -
TT-CE+ (Croitoru et al., 2021) 29.6 61.6 74.2 3.0 - 32.1 62.7 75.0 3.0 -
CLIP-straight (Portillo-Quintero et al., 2021) 31.2 53.7 64.2 4.0 - 27.2 51.7 62.6 5.0 -
CLIP4Clip-MeanP (ViT-B/32) (Luo et al., 2021) 43.1 70.4 80.8 2.0 16.2 43.1 70.5 81.2 2.0 12.4
CLIP4Clip-seqLSTM (ViT-B/32) (Luo et al., 2021) 42.5 70.8 80.7 2.0 16.7 42.8 71.0 80.4 2.0 12.3
CLIP4Clip-seqTransf (ViT-B/32) (Luo et al., 2021) 44.5 71.4 81.6 2.0 15.3 42.7 70.9 80.6 2.0 11.6
CLIP4Clip-tightTransf (ViT-B/32) (Luo et al., 2021) 40.2 71.5 80.5 2.0 13.4 40.6 69.5 79.5 2.0 13.6
CLIP4Clip-MeanP (ViT-B/16) (Luo et al., 2021) 45.3 73.3 83.0 2.0 13.0 44.8 73.2 82.2 2.0 9.6
CLIP4Clip-seqLSTM (ViT-B/16) (Luo et al., 2021) 44.3 72.0 82.2 2.0 13.7 44.3 73.4 82.4 2.0 10.3
CLIP4Clip-seqTransf (ViT-B/16) (Luo et al., 2021) 46.4 72.1 82.0 2.0 14.7 45.4 73.4 82.4 2.0 10.7
CLIP4Clip-tightTransf (ViT-B/16) (Luo et al., 2021) 42.9 71.7 81.5 2.0 13.3 41.9 71.0 80.7 2.0 10.1
X-CLIP (ViT-B/32) 46.1 73.0 83.1 2.0 13.2 46.8 73.3 84.0 2.0 9.1
X-CLIP (ViT-B/16) 49.3 75.8 84.8 2.0 12.2 48.9 76.8 84.5 2.0 8.1
Table 1. Retrieval performance comparison to SOTAs on the MSR-VTT dataset.

3.4. Similarity Calculation

The similarity score measures the semantic similarity between the two instances. Different from the previous work (Luo et al., 2021) that only consider the coarse-grained contrast, our proposed X-CLIP adopt multi-grained contrast during retrieval. Therefore, the final similarity score of X-CLIP contains multi-grained contrastive similarity scores, which can be represented as follows:

(14)

3.5. Objective Function

During training, given a batch of video-text pairs, the model will generate a similarity matrix. We adopt the symmetric InfoNCE loss over the similarity matrix to optimize the retrieval model, which can be formulated as:

(15)
(16)
(17)

4. Experiments

4.1. Datasets

MSR-VTT (Xu et al., 2016) is a popular video-text retrieval dataset, which contains 10,000 videos and 200,000 captions. The length of videos in this dataset ranges from 10 to 32 seconds. In this paper, we adopt the widely-used ‘Training-9K’ split, where 9,000 videos and 180,000 captions are used for training and the rest are used for testing.

MSVD (Chen and Dolan, 2011) contains 1,970 videos, the duration of which vary from 1 to 62 seconds. Each video is annotated with 40 English captions. We use 1,200, 100, 670 videos for training, validating, and testing.

LSMDC (Rohrbach et al., 2015) is a dataset that contains 118,081 videos and captions. The duration of each video ranges from 2 to 30 seconds. We adopt 109,673, 7,408, and 1,000 videos for training, validating, and testing.

DiDeMo (Anne Hendricks et al., 2017) contains 10,000 videos and 40,000 captions. Following previous works (Liu et al., 2019; Lei et al., 2021; Bain et al., 2021), all captions of a video are concatenated together during video-paragraph retrieval.

ActivityNet (Caba Heilbron et al., 2015) contains 20,000 YouTube videos, which are annotated temporally. Following previous works (Luo et al., 2021; Sun et al., 2019a; Gabeur et al., 2020), all captions of a video are also concatenated together during video-paragraph retrieval for fair comparison.

4.2. Experimental Settings

4.2.1. Implementation Details

We conduct the experiments on 4 NVIDIA Tesla V100 32GB GPUs using the PyTorch library. Following the previous work

(Luo et al., 2021), the text encoder and frame encoder of X-CLIP are initialized by the public CLIP checkpoints. We use the Adam optimizer (Kingma and Ba, 2015) to optimize the X-CLIP and decay the learning rate using a cosine schedule strategy (Loshchilov and Hutter, 2016)

. Since the parameters of the text encoder and frame encoder are initialized from the public CLIP checkpoints, we adopt different learning rates for different modules. Specifically, the initial learning rate for text encoder and frame encoder is 1e-7, and the initial learning rate for other modules is 1e-4. We set the max token length, max frame length, batch size, and the training epoch to 32, 12, 300, and 3 for MSR-VTT, MSVD, and LSMDC datasets. Since videos and captions in DiDeMo and ActivityNet are longer and more complex, we set the max token length, max frame length, and the training epoch to 64, 64, and 20. Due to the limitation of GPU memory, we also reduce the batch size of DiDeMo and ActivityNet to 64. We conduct ablation, quantitative and qualitative experiments on the MSR-VTT dataset, it is more popular and competitive compared with other datasets. The base model of X-CLIP is ViT-B/32 if not specified.

4.2.2. Evaluation Protocols

To evaluate the retrieval performance of our proposed model, we use recall at Rank K (R@K, higher is better), median rank (MdR, lower is better), and mean rank (MnR, lower is better) as retrieval metrics, which are widely used in previous retrieval works (Yu et al., 2018; Zhu and Yang, 2020; Lei et al., 2021; Liu et al., 2019; Gabeur et al., 2020; Dzabraev et al., 2021; Mithun et al., 2018; Zhang et al., 2018; Liu et al., 2021; Dong et al., 2019; Bertasius et al., 2021; Arnab et al., 2021; Luo et al., 2021).

4.3. Performance Comparison

We compare X-CLIP against the previous works on MSR-VTT, MSVD, LSMDC, DiDeMo, and ActivityNet. X-CLIP achieves the SOTA results on all five datasets with significant improvements.

For the MSR-VTT dataset, the performance comparison is shown in Tab. 1. By analyzing the table, we gain the following observations:

  • [leftmargin=*]

  • Benefiting from the large-scale image-text pre-training, both CLIP4Clip and our model X-CLIP can obtain significant gains in performance compared with all the baselines. The consistent improvements verify that it is important to adopt end-to-end finetuning to realize the full potential of the image-text pre-trained model on video-text retrieval.

  • Compared with the strongest competitor (i.e., CLIP4Clip-seqTransf), X-CLIP obtains 49.3 R@1 (6.3% relative improvement, 2.9% absolute improvement) in the text-to-video retrieval task and 48.9 R@1 (7.7% relative improvement, 3.5% absolute improvement) in the video-to-text retrieval task by employing CLIP(ViT-B/16) as pre-trained model. This can be attributed to that our proposed cross-grained contrast and the AOSM module are critical to reducing the bad effects of unnecessary frames and unimportant words.

  • Compared to all the other state-of-the-arts, our model with ViT-B/16 achieves the best performance in all metrics. Surprisingly, our model with the ViT-B/32 can even achieve comparable performance to CLIP4Clip with ViT-B/16, which again demonstrates the effectiveness and superiority of multi-grained contrast and the AOSM module.

Text-to-Video Video-to-Text
Model R@1↑ R@5↑ MnR↓ R@1↑ R@5↑ MnR↓
Multi Cues (Mithun et al., 2018) 20.3 47.8 - - - -
CE (Liu et al., 2019) 19.8 49.0 - - - -
SSB (Patrick et al., 2020) 28.4 60.0 - - - -
NoiseE (Amrani et al., 2020) 20.3 49.0 - - - -
CLIP-straight (Portillo-Quintero et al., 2021) 37.0 64.1 - 59.9 85.2 -
Frozen (Bain et al., 2021) 33.7 64.7 - - - -
TT-CE+ (Croitoru et al., 2021) 25.4 56.9 - 27.1 55.3 -
CLIP4Clip-MeanP (ViT-B/32) (Luo et al., 2021) 46.2 76.1 10.0 56.6 79.7 7.6
CLIP4Clip-seqTransf (ViT-B/32) (Luo et al., 2021) 45.2 75.5 10.3 62.0 87.3 4.3
CLIP4Clip-MeanP (ViT-B/16) (Luo et al., 2021) 47.3 77.7 9.1 62.9 87.2 4.2
CLIP4Clip-seqTransf (ViT-B/16) (Luo et al., 2021) 47.2 77.7 9.1 63.2 87.2 4.2
X-CLIP (ViT-B/32) 47.1 77.8 9.5 60.9 87.8 4.7
X-CLIP (ViT-B/16) 50.4 80.6 8.4 66.8 90.4 4.2
Table 2. Retrieval performance comparison on MSVD.
Text-to-Video Video-to-Text
Model R@1↑ R@5↑ MnR↓ R@1↑ R@5↑ MnR↓
CT-SAN (Yu et al., 2017) 5.1 16.3 - - - -
JSFusion (Yu et al., 2018) 9.1 21.2 - 12.3 28.6 -
CE (Liu et al., 2019) 11.2 26.9 96.8 - - -
MMT (Gabeur et al., 2020) 12.9 29.9 75.0 - - -
NoiseE (Amrani et al., 2020) 6.4 19.8 - - - -
CLIP-straight (Portillo-Quintero et al., 2021) 11.3 22.7 - 6.8 16.4 -
MDMMT (Dzabraev et al., 2021) 18.8 38.5 58.0 - - -
Frozen (Bain et al., 2021) 15.0 30.8 - - - -
HiT (Liu et al., 2021) 14.0 31.2 - - - -
TT-CE+ (Croitoru et al., 2021) 17.2 36.5 - 17.5 36.0 -
CLIP4Clip-MeanP (ViT-B/32) (Luo et al., 2021) 20.7 38.9 65.3 20.6 39.4 56.7
CLIP4Clip-seqTransf (ViT-B/32) (Luo et al., 2021) 22.6 41.0 61.0 20.8 39.0 54.2
CLIP4Clip-MeanP (ViT-B/16) (Luo et al., 2021) 23.5 43.2 54.8 22.6 50.5 50.3
CLIP4Clip-seqTransf (ViT-B/16) (Luo et al., 2021) 23.5 45.2 51.6 23.2 42.4 47.4
X-CLIP (ViT-B/32) 23.3 43.0 56.0 22.5 42.2 50.7
X-CLIP (ViT-B/16) 26.1 48.4 46.7 26.9 46.2 41.9
Table 3. Retrieval performance comparison on LSMDC.

We also further validate the generalization of X-CLIP on MSVD, LSMDC, DiDeMo and ActivityNet in Tab. 2 - 5. It is worth noting that, in all variants of CLIP4Clip, we only report the performance of CLIP4Clip-MeanP and CLIP4Clip-seqTranf, because they perform better than the other two variants in consideration of experience in the previous work (Luo et al., 2021) and performance comparison in Tab. 1. By analyzing these tables, we can observe that X-CLIP also achieves significant improvement on these datasets for text-to-video and video-to-text retrieval tasks. Specifically, for the text-to-video retrieval task, X-CLIP outperforms the CLIP4Clip with ViT-B/16 on R@1 by +6.6% (+3.1%), +11.1% (+2.6%), +6.7% (+3.0%), +3.8% (+1.7%) relative (absolute) improvement on aforesaid four datasets respectively. For the video-to-text retrieval task, X-CLIP obtains +5.7% (+3.6%), +12.9% (+3.0%), +1.3% (+0.6%), +5.2% (+2.3%) relative (absolute) improvement on R@1. This demonstrates that our proposed X-CLIP can achieve consistent performance improvement on several video-text retrieval datasets. More experimental results are in the supplementary materials.

Text-to-Video Video-to-Text
Model R@1↑ R@5↑ MnR↓ R@1↑ R@5↑ MnR↓
S2VT (Venugopalan et al., 2014) 11.9 33.6 - 13.2 33.6 -
FSE (Zhang et al., 2018) 13.9 36.0 - 13.1 33.9 -
CE (Liu et al., 2019) 16.1 41.1 43.7 15.6 40.9 42.4
ClipBERT (Lei et al., 2021) 20.4 48.0 - - - -
Frozen (Bain et al., 2021) 34.6 65.0 - - - -
TT-CE+ (Croitoru et al., 2021) 21.6 48.6 - 21.1 47.3 -
CLIP4Clip-MeanP (ViT-B/32) (Luo et al., 2021) 43.4 70.2 17.5 42.5 70.6 11.6
CLIP4Clip-seqTransf (ViT-B/32) (Luo et al., 2021) 42.8 68.5 18.9 41.4 68.2 12.4
CLIP4Clip-MeanP (ViT-B/16) (Luo et al., 2021) 44.8 75.1 13.0 47.2 74.0 10.5
CLIP4Clip-seqTransf (ViT-B/16) (Luo et al., 2021) 44.8 73.4 13.5 44.7 74.0 10.6
X-CLIP (ViT-B/32) 45.2 74.0 14.6 43.1 72.2 10.9
X-CLIP (ViT-B/16) 47.8 79.3 12.6 47.8 76.8 10.5
Table 4. Retrieval performance comparison on DiDeMo.
Text-to-Video Video-to-Text
Model R@1↑ R@5↑ MnR↓ R@1↑ R@5↑ MnR↓
FSE (Zhang et al., 2018) 18.2 44.8 - 16.7 43.1 -
CE (Liu et al., 2019) 18.2 47.7 23.1 17.7 46.6 24.4
HSE (Zhang et al., 2018) 20.5 49.3 - 18.7 48.1 -
MMT (Gabeur et al., 2020) 28.7 61.4 16.0 28.9 61.1 17.1
SSB (Patrick et al., 2020) 29.2 61.6 - 28.7 60.8 -
HiT (Liu et al., 2021) 29.6 60.7 - - - -
ClipBERT (Lei et al., 2021) 21.3 49.0 - - - -
TT-CE+ (Croitoru et al., 2021) 23.5 57.2 - 23.0 56.1 -
CLIP4Clip-MeanP (ViT-B/32) (Luo et al., 2021) 40.5 72.4 7.4 42.5 74.1 6.6
CLIP4Clip-seqTransf (ViT-B/32) (Luo et al., 2021) 40.5 72.4 7.5 41.4 73.7 6.7
CLIP4Clip-MeanP (ViT-B/16) (Luo et al., 2021) 44.0 73.9 7.0 44.1 74.0 6.5
CLIP4Clip-seqTransf (ViT-B/16) (Luo et al., 2021) 44.5 75.2 6.4 44.1 75.2 6.4
X-CLIP (ViT-B/32) 44.3 74.1 7.9 43.9 73.9 7.6
X-CLIP (ViT-B/16) 46.2 75.5 6.8 46.4 75.9 6.4
Table 5. Retrieval performance comparison on ActivityNet.

4.4. Ablation Study

To fully examine the impact of different contrastive modules, we conduct an ablation study to compare different variants of X-CLIP. As shown in Tab. 6, we gain two important observations:

  • [leftmargin=*]

  • With the number of contrastive modules increasing, the retrieval performance tends to be higher. When X-CLIP is equipped with all contrastive modules, the best retrieval performance can be achieved. This may be because each contrastive module plays a different role in the retrieval task and different contrast modules can promote each other to achieve better retrieval results.

  • Our proposed cross-grained contrast can assist fine-grained contrast or coarse-grained contrast to achieve better performance in the retrieval task. Specifically, X-CLIP with the sentence-video contrast module (i.e., Exp1) only achieves 43.0 R@1 in the text-to-video retrieval task. However, when X-CLIP is additionally equipped with cross-grained contrast modules (i.e., Exp8 and Exp9), the performance gets obvious absolute improvements of 2.4% and 1.0% respectively. Similarly, when X-CLIP is only equipped with fine-grained and coarse-grained contrast modules (i.e., Exp10), it achieves 44.8 R@1 in the text-to-video task. However, when it is additionally equipped with cross-grained contrast modules (i.e., Exp13 and Exp14), 1.0% and 0.7% absolute improvement of R@1 can be achieved. Therefore, we conclude that the performance improvement of cross-grained contrast modules in the retrieval task does not conflict with that of coarse-grained and fine-grained contrast modules.

Contrastive Module Text-to-Video Video-to-Text
ID Sent-Video Sent-Frame Word-Video Word-Frame R@1↑ R@5↑ R@10↑ MnR↓ R@1↑ R@5↑ R@10↑ MnR↓
Exp1 43.0 70.7 81.6 16.3 43.0 70.2 81.2 11.5
Exp2 42.7 69.6 81.3 13.9 43.1 70.7 82.1 9.9
Exp3 42.8 69.9 80.1 17.0 43.2 70.1 80.5 13.8
Exp4 42.7 69.5 81.3 14.4 42.8 70.8 81.7 10.6
Exp5 44.6 72.8 82.4 13.9 45.7 73.2 82.3 9.1
Exp6 45.6 72.0 82.0 13.6 44.8 72.5 81.7 9.6
Exp7 44.1 70.2 81.3 14.3 44.4 71.6 82.8 9.7
Exp8 45.4 72.2 81.6 13.4 45.4 72.8 82.7 9.2
Exp9 44.0 70.3 82.5 13.9 43.6 70.9 81.8 11.3
Exp10 44.8 72.6 83.0 13.6 45.3 73.0 83.8 9.5
Exp11 45.7 72.7 82.5 13.2 45.6 72.8 82.9 9.2
Exp12 45.7 72.7 82.5 13.2 45.6 72.8 82.9 9.2
Exp13 45.8 73.2 82.7 13.2 46.5 72.6 83.8 9.7
Exp14 45.5 72.8 82.9 13.5 46.4 72.5 83.7 9.6
Exp15 46.1 73.0 83.1 13.2 46.8 73.3 84.0 9.1
Table 6. Retrieval performance with different contrastive granularity on the MSR-VTT dataset.

To justify the effectiveness of the proposed AOSM module, we compare our method with the conventional Mean-Max and other variants (i.e., Max-Max, Max-Mean and Mean-Mean). As shown in Tab. 7, we observe that the Mean-Mean strategy performs worst. This may be because the Mean-Mean strategy, which applies the same weight to all similarity scores during aggregating, can not eliminate the adverse effects of unnecessary frames and unimportant words on the retrieval results. The Max-Mean, Mean-Max and Max-Max strategies perform better than the Mean-Mean strategy. This can be attributed to that these strategies adopt the highest similarity during aggregation, so contrast scores between unnecessary frames and unimportant words will be filtered out. However, since these strategies adopt the top-1 similarity score, some important similarity scores will also be ignored. To address this issue, we propose the AOSM module, where all similarity scores will be applied with different weights during aggregation. From Tab. 7, we observe that compared with other strategies, our proposed attention mechanism achieves better performance.

Text-to-Video Video-to-Text
Method R@1↑ R@5↑ MnR↓ R@1↑ R@5↑ MnR↓
Max-Max 44.0 72.6 13.5 44.4 72.5 9.2
Mean-Mean 43.2 71.2 14.8 42.5 70.2 11.4
Mean-Max 44.4 71.1 14.9 44.2 71.7 10.2
Max-Mean 44.9 71.3 13.5 43.8 71.8 9.4
Attention 46.1 73.0 13.2 46.8 73.3 9.1
Table 7. Retrieval performance with different fusion methods for similarity matrices on the MSR-VTT dataset.
Figure 3. Top-3 video-to-text retrieval results on MSR-VTT. The number in parentheses is the similarity score.
Text-to-Video Video-to-Text
Base Model TE R@1↑ R@5↑ MnR↓ R@1↑ R@5↑ MnR↓
ViT-B/32 45.2 72.9 13.8 45.6 73.9 9.2
46.1 73.0 13.2 46.8 73.3 9.1
ViT-B/16 48.3 75.3 13.4 47.6 76.1 9.0
49.3 75.8 12.2 48.9 76.8 8.1
Table 8. Ablation study of temporal encoder on the MSR-VTT dataset. TE is short for temporal encoder.

To explore the impact of the temporal encoder module in X-CLIP, we also conduct an ablative study to compare the X-CLIP with and without the temporal encoder. As shown in Tab 8, based on either ViT-B/32 or ViT/16, X-CLIP with temporal encoder consistently outperforms X-CLIP without temporal encoder. This may be because the temporal encoder is used to model the temporal relation of different frames in a video. Therefore, X-CLIP without temporal encoder can not understand and perceive the information that requires a combination of multiple frames, e.g., action. Based on the above analysis, we conclude that temporal modeling is also a key to improving the performance of retrieval tasks.

4.5. Effect of Temperature Parameter

Text-to-Video Video-to-Text
R@1↑ R@5↑ MnR↓ R@1↑ R@5↑ MnR↓
1 43.9 71.6 14.5 43.5 71.3 11.3
0.1 45.2 72.2 14.0 45.3 73.1 9.3
0.01 46.1 73.0 13.2 46.8 73.3 9.1
0.001 45.6 72.2 13.7 43.6 72.5 9.4
Table 9. Retrieval performance with different temprature parameters in Softmax on the MSR-VTT dataset.

To explore the effect of different in the AOSM module, we also designed a group of experiments by setting different temperature parameters in Softmax. From Tab. 9, we observe that the retrieval performance first improves before reaching the saturation point (i.e., ), and then begins to decline slightly. The main reason may be that when is large, too many noisy similarity scores are considered. On the contrary, if the is small, some important similarity scores may be ignored. Besides, our proposed attention mechanism with different consistently performs better than the Mean-Mean strategy, and the attention mechanism with the optimal outperforms other strategies in all evaluation protocols. This justifies that our proposed attention mechanism helps to strengthen the influence of important similarity scores and weaken the influence of noisy similarity scores, thus achieving better retrieval performance.

4.6. Qualitative Analysis

Figure 4. Top-3 text-to-video retrieval results on MSR-VTT. The number in parentheses is the similarity score.

To qualitatively validate the effectiveness of our proposed X-CLIP, we show some typical video-to-text and text-to-video retrieval examples in Fig. 3 and Fig. 4, respectively. From these retrieval results, we find that X-CLIP could accurately understand the content of sentences and videos. Meanwhile, it is robust for X-CLIP to comprehend complex and similar sentences and videos, which is mainly attributed to the multi-grained contrast of our proposed model. To be specific, as shown in the first example in Fig.3, although the top-3 retrieved sentences are similar, our proposed X-CLIP can still choose the correct sentence by understanding the details of sentences and videos. Similarly, as shown in the first example in Fig.4, all top-3 retrieved videos describe the same cartoon, while “squid” does not appear in the second and third videos. Due to the multi-grained contrast, X-CLIP performs well in visual and textual content understanding, so it can retrieve the correct video.

5. Conclusion

In this paper, we present X-CLIP, a novel end-to-end multi-grained contrastive model for video-text retrieval, which first encodes the sentences and videos into coarse-grained and fine-grained representations, and conducts fine-grained, coarse-grained, and cross-grained contrasts over these representations. The multi-grained contrast and the AOSM module of X-CLIP help to reduce the negative effects of unnecessary frames and unimportant words during retrieval. Significant performance gains on five popular video-text retrieval datasets demonstrate the effectiveness and superiority of our proposed model.

Acknowledgements.
This work was supported by the National Science Fund for Distinguished Young Scholars (No.62025603), the National Natural Science Foundation of China (No. U21B2037, No. 62176222, No. 62176223, No. 62176226, No. 62072386, No. 62072387, No. 62072389, and No. 62002305), Guangdong Basic and Applied Basic Research Foundation (No.2019B1515120049), and the Natural Science Foundation of Fujian Province of China (No.2021J01002). This work was supported by Alibaba Group through Alibaba Research Intern Program.

References

  • E. Amrani, R. Ben-Ari, D. Rotman, and A. Bronstein (2020)

    Noise estimation using density estimation for self-supervised multimodal learning

    .
    arXiv preprint arXiv:2003.03186 8. Cited by: Table 2, Table 3.
  • L. Anne Hendricks, O. Wang, E. Shechtman, J. Sivic, T. Darrell, and B. Russell (2017)

    Localizing moments in video with natural language

    .
    In

    Proceedings of the IEEE international conference on computer vision

    ,
    pp. 5803–5812. Cited by: §1, §4.1.
  • A. Arnab, M. Dehghani, G. Heigold, C. Sun, M. Lučić, and C. Schmid (2021) Vivit: a video vision transformer. arXiv preprint arXiv:2103.15691. Cited by: §1, §4.2.2.
  • M. Bain, A. Nagrani, G. Varol, and A. Zisserman (2021) Frozen in time: a joint video and image encoder for end-to-end retrieval. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1728–1738. Cited by: §1, §2.1, §2.2, Table 1, §4.1, Table 2, Table 3, Table 4.
  • G. Bertasius, H. Wang, and L. Torresani (2021) Is space-time attention all you need for video understanding?. arXiv preprint arXiv:2102.05095. Cited by: §1, §4.2.2.
  • F. Caba Heilbron, V. Escorcia, B. Ghanem, and J. Carlos Niebles (2015) Activitynet: a large-scale video benchmark for human activity understanding. In

    Proceedings of the ieee conference on computer vision and pattern recognition

    ,
    pp. 961–970. Cited by: §1, §4.1.
  • D. Chen and W. B. Dolan (2011) Collecting highly parallel data for paraphrase evaluation. In Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies, pp. 190–200. Cited by: §1, §4.1.
  • T. Chen, S. Kornblith, M. Norouzi, and G. Hinton (2020a) A simple framework for contrastive learning of visual representations. In

    International conference on machine learning

    ,
    pp. 1597–1607. Cited by: §2.3.
  • X. Chen, H. Fan, R. Girshick, and K. He (2020b) Improved baselines with momentum contrastive learning. arXiv preprint arXiv:2003.04297. Cited by: §2.3.
  • X. Chen, S. Xie, and K. He (2021) An empirical study of training self-supervised vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9640–9649. Cited by: §2.3.
  • I. Croitoru, S. Bogolin, M. Leordeanu, H. Jin, A. Zisserman, S. Albanie, and Y. Liu (2021) Teachtext: crossmodal generalized distillation for text-video retrieval. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 11583–11593. Cited by: Table 1, Table 2, Table 3, Table 4, Table 5.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §2.1.
  • J. Dong, X. Li, C. Xu, S. Ji, Y. He, G. Yang, and X. Wang (2019) Dual encoding for zero-example video retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9346–9355. Cited by: §1, §4.2.2.
  • M. Dzabraev, M. Kalashnikov, S. Komkov, and A. Petiushko (2021) Mdmmt: multidomain multimodal transformer for video retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3354–3363. Cited by: §1, Table 1, §4.2.2, Table 3.
  • V. Gabeur, C. Sun, K. Alahari, and C. Schmid (2020) Multi-modal transformer for video retrieval. In European Conference on Computer Vision, pp. 214–229. Cited by: §1, §2.2, Table 1, §4.1, §4.2.2, Table 3, Table 5.
  • F. He, Q. Wang, Z. Feng, W. Jiang, Y. Lü, Y. Zhu, and X. Tan (2021) Improving video retrieval by adaptive margin. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 1359–1368. Cited by: §2.2.
  • J. He, Y. Zhou, Q. Zhang, J. Peng, Y. Shen, X. Sun, C. Chen, and R. Ji (2022) PixelFolder: an efficient progressive pixel synthesis network for image generation. arXiv preprint arXiv:2204.00833. Cited by: §2.3.
  • K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick (2020) Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 9729–9738. Cited by: §2.3.
  • Y. Huo, M. Zhang, G. Liu, H. Lu, Y. Gao, G. Yang, J. Wen, H. Zhang, B. Xu, W. Zheng, et al. (2021) WenLan: bridging vision and language by large-scale multi-modal pre-training. arXiv preprint arXiv:2103.06561. Cited by: §2.1.
  • Y. Jang, Y. Song, Y. Yu, Y. Kim, and G. Kim (2017) Tgif-qa: toward spatio-temporal reasoning in visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2758–2766. Cited by: §2.2.
  • J. Ji, Y. Luo, X. Sun, F. Chen, G. Luo, Y. Wu, Y. Gao, and R. Ji (2021)

    Improving image captioning by leveraging intra-and inter-layer global representation in transformer network

    .
    In Proceedings of the AAAI conference on artificial intelligence, Vol. 35, pp. 1655–1663. Cited by: §2.3.
  • J. Ji, Y. Ma, X. Sun, Y. Zhou, Y. Wu, and R. Ji (2022) Knowing what to learn: a metric-oriented focal mechanism for image captioning. IEEE Transactions on Image Processing (), pp. 1–1. External Links: Document Cited by: §2.3.
  • C. Jia, Y. Yang, Y. Xia, Y. Chen, Z. Parekh, H. Pham, Q. Le, Y. Sung, Z. Li, and T. Duerig (2021) Scaling up visual and vision-language representation learning with noisy text supervision. In International Conference on Machine Learning, pp. 4904–4916. Cited by: §2.1.
  • W. Jin, Z. Zhao, P. Zhang, J. Zhu, X. He, and Y. Zhuang (2021) Hierarchical cross-modal graph consistency learning for video-text retrieval. In SIGIR ’21: The 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, Virtual Event, Canada, July 11-15, 2021, F. Diaz, C. Shah, T. Suel, P. Castells, R. Jones, and T. Sakai (Eds.), pp. 1114–1124. External Links: Link, Document Cited by: §1.
  • O. Khattab, C. Potts, and M. Zaharia (2021) Relevance-guided supervision for openqa with colbert. Transactions of the Association for Computational Linguistics 9, pp. 929–944. Cited by: §1, §3.3.
  • O. Khattab and M. Zaharia (2020) Colbert: efficient and effective passage search via contextualized late interaction over bert. In Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval, pp. 39–48. Cited by: §1, §3.3.
  • D. P. Kingma and J. Ba (2015) Adam: a method for stochastic optimization. iclr 2015. arXiv preprint arXiv:1412.6980 9. Cited by: §4.2.1.
  • T. M. Le, V. Le, S. Venkatesh, and T. Tran (2020) Hierarchical conditional relation networks for video question answering. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 9972–9981. Cited by: §2.2.
  • K. Lee, X. Chen, G. Hua, H. Hu, and X. He (2018) Stacked cross attention for image-text matching. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 201–216. Cited by: §1, §1, §3.2.
  • J. Lei, L. Li, L. Zhou, Z. Gan, T. L. Berg, M. Bansal, and J. Liu (2021) Less is more: clipbert for video-and-language learning via sparse sampling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7331–7341. Cited by: §1, §2.2, §4.1, §4.2.2, Table 4, Table 5.
  • J. Li, R. Selvaraju, A. Gotmare, S. Joty, C. Xiong, and S. C. H. Hoi (2021) Align before fuse: vision and language representation learning with momentum distillation. Advances in Neural Information Processing Systems 34. Cited by: §2.1.
  • L. Li, Y. Chen, Y. Cheng, Z. Gan, L. Yu, and J. Liu (2020a) Hero: hierarchical encoder for video+ language omni-representation pre-training. arXiv preprint arXiv:2005.00200. Cited by: §2.1.
  • X. Li, X. Yin, C. Li, P. Zhang, X. Hu, L. Zhang, L. Wang, H. Hu, L. Dong, F. Wei, et al. (2020b) Oscar: object-semantics aligned pre-training for vision-language tasks. In European Conference on Computer Vision, pp. 121–137. Cited by: §2.1.
  • S. Liu, H. Fan, S. Qian, Y. Chen, W. Ding, and Z. Wang (2021) Hit: hierarchical transformer with momentum contrast for video-text retrieval. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 11915–11925. Cited by: §1, §2.2, Table 1, §4.2.2, Table 3, Table 5.
  • Y. Liu, S. Albanie, A. Nagrani, and A. Zisserman (2019) Use what you have: video retrieval using representations from collaborative experts. arXiv preprint arXiv:1907.13487. Cited by: §1, Table 1, §4.1, §4.2.2, Table 2, Table 3, Table 4, Table 5.
  • I. Loshchilov and F. Hutter (2016)

    Sgdr: stochastic gradient descent with warm restarts

    .
    arXiv preprint arXiv:1608.03983. Cited by: §4.2.1.
  • J. Lu, D. Batra, D. Parikh, and S. Lee (2019) Vilbert: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Advances in neural information processing systems 32. Cited by: §2.1.
  • H. Luo, L. Ji, M. Zhong, Y. Chen, W. Lei, N. Duan, and T. Li (2021) Clip4clip: an empirical study of clip for end to end video clip retrieval. arXiv preprint arXiv:2104.08860. Cited by: §1, §1, §1, §2.1, §2.2, §3.1.1, §3.2, §3.4, Table 1, §4.1, §4.2.1, §4.2.2, §4.3, Table 2, Table 3, Table 4, Table 5.
  • Y. Ma, J. Ji, X. Sun, Y. Zhou, Y. Wu, F. Huang, and R. Ji (2022) Knowing what it is: semantic-enhanced dual attention transformer. IEEE Transactions on Multimedia (), pp. 1–1. External Links: Document Cited by: §2.3.
  • A. Miech, J. Alayrac, L. Smaira, I. Laptev, J. Sivic, and A. Zisserman (2020) End-to-end learning of visual representations from uncurated instructional videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9879–9889. Cited by: §2.2.
  • N. C. Mithun, J. Li, F. Metze, and A. K. Roy-Chowdhury (2018) Learning joint embedding with multimodal cues for cross-modal video-text retrieval. In Proceedings of the 2018 ACM on International Conference on Multimedia Retrieval, pp. 19–27. Cited by: §1, §4.2.2, Table 2.
  • M. Patrick, P. Huang, Y. Asano, F. Metze, A. Hauptmann, J. Henriques, and A. Vedaldi (2020) Support-set bottlenecks for video-text representation learning. arXiv preprint arXiv:2010.02824. Cited by: §2.2, Table 1, Table 2, Table 5.
  • J. A. Portillo-Quintero, J. C. Ortiz-Bayliss, and H. Terashima-Marín (2021) A straightforward framework for video retrieval using clip. In Mexican Conference on Pattern Recognition, pp. 3–12. Cited by: Table 1, Table 2, Table 3.
  • A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al. (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:2103.00020. Cited by: §1, §1, §1, §2.1, §2.3, §3.1.1, §3.1.3.
  • A. Rohrbach, M. Rohrbach, and B. Schiele (2015) The long-short story of movie description. In German conference on pattern recognition, pp. 209–221. Cited by: §1, §4.1.
  • A. Rouditchenko, A. Boggust, D. Harwath, B. Chen, D. Joshi, S. Thomas, K. Audhkhasi, H. Kuehne, R. Panda, R. Feris, et al. (2020) Avlnet: learning audio-visual language representations from instructional videos. arXiv preprint arXiv:2006.09199. Cited by: Table 1.
  • K. Santhanam, O. Khattab, J. Saad-Falcon, C. Potts, and M. Zaharia (2021) Colbertv2: effective and efficient retrieval via lightweight late interaction. arXiv preprint arXiv:2112.01488. Cited by: §1, §3.3.
  • R. Sennrich, B. Haddow, and A. Birch (2015) Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909. Cited by: §3.1.3.
  • C. Sun, F. Baradel, K. Murphy, and C. Schmid (2019a) Learning video representations using contrastive bidirectional transformer. arXiv preprint arXiv:1906.05743. Cited by: §4.1.
  • C. Sun, A. Myers, C. Vondrick, K. Murphy, and C. Schmid (2019b) Videobert: a joint model for video and language representation learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7464–7473. Cited by: §2.1.
  • H. Tan and M. Bansal (2019) Lxmert: learning cross-modality encoder representations from transformers. arXiv preprint arXiv:1908.07490. Cited by: §2.1.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. Advances in neural information processing systems 30. Cited by: §3.1.1.
  • S. Venugopalan, H. Xu, J. Donahue, M. Rohrbach, R. Mooney, and K. Saenko (2014)

    Translating videos to natural language using deep recurrent neural networks

    .
    arXiv preprint arXiv:1412.4729. Cited by: Table 4.
  • X. Wang, L. Zhu, and Y. Yang (2021) T2VLAD: global-local sequence alignment for text-video retrieval. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, virtual, June 19-25, 2021, pp. 5079–5088. External Links: Link Cited by: §1.
  • H. Xu, M. Yan, C. Li, B. Bi, S. Huang, W. Xiao, and F. Huang (2021) E2E-VLP: end-to-end vision-language pre-training enhanced by visual learning. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Online, pp. 503–513. External Links: Link, Document Cited by: §2.1.
  • J. Xu, T. Mei, T. Yao, and Y. Rui (2016) Msr-vtt: a large video description dataset for bridging video and language. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5288–5296. Cited by: §1, §4.1.
  • J. Yang, Y. Bisk, and J. Gao (2021) Taco: token-aware cascade contrastive learning for video-text alignment. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 11562–11572. Cited by: §1, §1, §2.3.
  • X. Yang, J. Dong, Y. Cao, X. Wang, M. Wang, and T. Chua (2020) Tree-augmented cross-modal encoding for complex-query video retrieval. In Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval, SIGIR 2020, Virtual Event, China, July 25-30, 2020, J. Huang, Y. Chang, X. Cheng, J. Kamps, V. Murdock, J. Wen, and Y. Liu (Eds.), pp. 1339–1348. External Links: Link, Document Cited by: §1.
  • L. Yao, R. Huang, L. Hou, G. Lu, M. Niu, H. Xu, X. Liang, Z. Li, X. Jiang, and C. Xu (2021) FILIP: fine-grained interactive language-image pre-training. arXiv preprint arXiv:2111.07783. Cited by: §1, §1, §2.3, §3.2, §3.3.
  • F. Yu, J. Tang, W. Yin, Y. Sun, H. Tian, H. Wu, and H. Wang (2020) Ernie-vil: knowledge enhanced vision-language representations through scene graph. arXiv preprint arXiv:2006.16934. Cited by: §2.1.
  • Y. Yu, J. Kim, and G. Kim (2018) A joint sequence fusion model for video question answering and retrieval. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 471–487. Cited by: §1, §4.2.2, Table 3.
  • Y. Yu, H. Ko, J. Choi, and G. Kim (2017) End-to-end concept word detection for video captioning, retrieval, and question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3165–3173. Cited by: §2.2, Table 3.
  • B. Zhang, H. Hu, and F. Sha (2018) Cross-modal and hierarchical modeling of video and text. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 374–390. Cited by: §1, §4.2.2, Table 4, Table 5.
  • X. Zhang, X. Sun, Y. Luo, J. Ji, Y. Zhou, Y. Wu, F. Huang, and R. Ji (2021) RSTNet: captioning with adaptive attention on visual and non-visual words. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 15465–15474. Cited by: §2.3.
  • C. Zhu, Y. Zhou, Y. Shen, G. Luo, X. Pan, M. Lin, C. Chen, L. Cao, X. Sun, and R. Ji (2022) SeqTR: a simple yet universal network for visual grounding. arXiv preprint arXiv:2203.16265. Cited by: §2.3.
  • L. Zhu and Y. Yang (2020) Actbert: learning global-local video-text representations. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 8746–8755. Cited by: §1, §4.2.2.

6. Appendix

6.1. More Performance Comparison

To verify the effectiveness of our method, we display the detailed comparison between our proposed X-CLIP and all variants of CLIP4Clip on different backbones (i.e., ViT-B/32 and ViT-B/16). As shown in Tab. 10 - Tab. 13, our proposed X-CLIP outperforms all variants of CLIP4Clip. Notably, X-CLIP with a weak backbone (i.e., ViT-B/32) even achieves comparable performance to CLIP4Clip with a strong backbone (i.e., ViT-B/16). This may be because our proposed cross-grained contrast is conducive to removing the noise information in the videos and sentences and capturing the important information. The outstanding performance again proves the importance and effectiveness of multi-grained contrast and the AOSM module.

Figure 5. Retrieval performance of models with different contrastive modules in different sizes of the training set on the MSR-VTT dataset.
Figure 6. Overview of the new X-CLIP, which uses a Transformer to model multi-grained features rather than the AOSM module.

6.2. Effect of training dataset size on contrastive modules

To gain deep insight into our four contrastive modules, we conduct the experiment to validate the X-CLIP with a single contrast module on the training datasets of different sizes. As illustrated in Fig. 5, when the training data is sufficient (i.e., 9k), the video-to-text and text-to-video retrieval performance of four variants is similar. When the size of the training dataset is reduced to 3k, the performance differences of different variants begin to appear and the word-frame contrastive module performs worse than other modules. Furthermore, when the size of the training dataset is reduced to 0.1k, other contrastive modules perform better than the word-frame contrastive module by a significant margin. The main reason can be that compared with other modules, the word-frame contrastive module is more complex, so it is difficult to optimize this module on a small amount of training data.

6.3. Effect of the AOSM module and Transformer modeling

To demonstrate the superiority and effectiveness of our proposed AOSM module, we also try to use a Transformer module to model the relationship of multi-grained features, which introduces more computation and parameters. The architecture of the new model is shown in Fig. 6. As shown in Tab. 14, our proposed AOSM module performs better than the Transf module. The performance gain can result from two aspects:

1) The new Transformer architecture introduces too many parameters, which makes it hard to be optimized with the limited amount of data. Our proposed AOSM is a well-designed module, where the importance of each frame and word is explicitly calculated. Thus, noise information in the video and sentence can be removed in X-CLIP. Besides, compared with Transformer, our proposed AOSM module contains fewer parameters, so it is easy to optimize the AOSM module.

2) The similarity scores of Transformer are obtained by a Linear layer, while the similarity scores of our proposed AOSM are obtained by the dot product. Notably, the dot product is the conventional approach for similarity calculation in the CLIP. However, Linear is a new approach, which does not carry any prior knowledge. Therefore, the prior knowledge of CLIP has little gain in Transformer, but our X-CLIP retains this prior knowledge well.

Text-to-Video Video-to-Text
Model R@1↑ R@5↑ MnR↓ R@1↑ R@5↑ MnR↓
CLIP4Clip-MeanP (ViT-B/32) 46.2 76.1 10.0 56.6 79.7 7.6
CLIP4Clip-seqLSTM (ViT-B/32) 46.2 75.3 10.2 52.5 74.0 14.7
CLIP4Clip-seqTransf (ViT-B/32) 45.2 75.5 10.3 62.0 87.3 4.3
CLIP4Clip-tightTransf (ViT-B/32) 40.0 71.5 13.3 54.3 85.3 6.0
CLIP4Clip-MeanP (ViT-B/16) 47.3 77.7 9.1 62.9 87.2 4.2
CLIP4Clip-seqLSTM (ViT-B/16) 48.4 78.0 9.1 61.2 87.6 5.0
CLIP4Clip-seqTransf (ViT-B/16) 47.2 77.7 9.1 63.2 87.2 4.2
CLIP4Clip-tightTransf (ViT-B/16) 43.6 75.3 10.8 58.8 88.9 4.7
X-CLIP (ViT-B/32) 47.1 77.8 9.5 60.9 87.8 4.7
X-CLIP (ViT-B/16) 50.4 80.6 8.4 66.8 90.4 4.2
Table 10. Retrieval performance comparison on MSVD.
Text-to-Video Video-to-Text
Model R@1↑ R@5↑ MnR↓ R@1↑ R@5↑ MnR↓
CLIP4Clip-MeanP (ViT-B/32) 20.7 38.9 65.3 20.6 39.4 56.7
CLIP4Clip-seqLSTM (ViT-B/32) 21.6 41.8 58.0 20.9 40.7 53.9
CLIP4Clip-seqTransf (ViT-B/32) 22.6 41.0 61.0 20.8 39.0 54.2
CLIP4Clip-tightTransf (ViT-B/32) 18.9 37.8 61.6 17.4 36.7 65.3
CLIP4Clip-MeanP (ViT-B/16) 23.5 43.2 54.8 22.6 50.5 50.3
CLIP4Clip-seqLSTM (ViT-B/16) 21.9 39.5 60.7 19.3 39.3 57.6
CLIP4Clip-seqTransf (ViT-B/16) 23.5 45.2 51.6 23.2 42.4 47.4
CLIP4Clip-tightTransf (ViT-B/16) 19.4 39.1 62.2 16.1 37.7 58.3
X-CLIP (ViT-B/32) 23.3 43.0 56.0 22.5 42.2 50.7
X-CLIP (ViT-B/16) 26.1 48.4 46.7 26.9 46.2 41.9
Table 11. Retrieval performance comparison on LSMDC.
Text-to-Video Video-to-Text
Model R@1↑ R@5↑ MnR↓ R@1↑ R@5↑ MnR↓
CLIP4Clip-MeanP (ViT-B/32) 43.4 70.2 17.5 42.5 70.6 11.6
CLIP4Clip-seqLSTM (ViT-B/32) 43.4 69.9 17.5 42.4 69.2 11.8
CLIP4Clip-seqTransf (ViT-B/32) 42.8 68.5 18.9 41.4 68.2 12.4
CLIP4Clip-tightTransf (ViT-B/32) 25.8 52.8 27.3 21.5 51.1 22.4
CLIP4Clip-MeanP (ViT-B/16) 44.8 75.1 13.0 47.2 74.0 10.5
CLIP4Clip-seqLSTM (ViT-B/16) 44.7 72.2 15.5 43.9 72.5 11.8
CLIP4Clip-seqTransf (ViT-B/16) 44.8 73.4 13.5 44.7 74.0 10.6
CLIP4Clip-tightTransf (ViT-B/16) 34.8 65.8 20.5 36.5 65.5 13.9
X-CLIP (ViT-B/32) 45.2 74.0 14.6 43.1 72.2 10.9
X-CLIP (ViT-B/16) 47.8 79.3 12.6 47.8 76.8 10.5
Table 12. Retrieval performance comparison on DiDeMo.
Text-to-Video Video-to-Text
Model R@1↑ R@5↑ MnR↓ R@1↑ R@5↑ MnR↓
CLIP4Clip-MeanP (ViT-B/32) 40.5 72.4 7.4 42.5 74.1 6.6
CLIP4Clip-seqLSTM (ViT-B/32) 40.1 72.2 7.3 42.6 73.4 6.7
CLIP4Clip-seqTransf (ViT-B/32) 40.5 72.4 7.5 41.4 73.7 6.7
CLIP4Clip-tightTransf (ViT-B/32) 19.5 47.6 17.3 18.9 49.6 16.3
CLIP4Clip-MeanP (ViT-B/16) 44.0 73.9 7.0 44.1 74.0 6.5
CLIP4Clip-seqLSTM (ViT-B/16) 44.4 74.9 6.4 44.7 75.1 6.3
CLIP4Clip-seqTransf (ViT-B/16) 44.5 75.2 6.4 44.1 75.2 6.4
CLIP4Clip-tightTransf (ViT-B/16) 30.8 64.3 9.8 29.6 62.3 9.9
X-CLIP (ViT-B/32) 44.3 74.1 7.9 43.9 73.9 7.6
X-CLIP (ViT-B/16) 46.2 75.5 6.8 46.4 75.9 6.4
Table 13. Retrieval performance comparison on ActivityNet.
Text-to-Video Retrieval Video-to-Text Retrieval
Model R@1↑ R@5↑ MnR↓ R@1↑ R@5↑ MnR↓
Trans(ViT-B/32) 38.9 68.8 13.9 39.4 69.4 11.6
X-CLIP(ViT-B/32) 46.1 73.0 13.2 46.8 73.3 9.1
Trans(ViT-B/16) 40.1 70.5 14.0 41.4 71.5 10.8
X-CLIP(ViT-B/16) 49.3 75.8 12.2 48.9 76.8 8.1
Table 14. Retrieval performance comparison between Transformer modeling and the AOSM module. Transf means using a 3-layer Transformer to model multi-grained features.