Video-text retrieval, which aims to return for a given textual query the most relevant videos, is a fundamental research task for multi-modal video-and-language understanding. It becomes an emerging requirement with the increasing of web videos. In the past years, remarkable progress [yu2018joint, zhu2020actbert, lei2021less, liu2019use, gabeur2020multi, dzabraev2021mdmmt, mithun2018learning, zhang2018cross, liu2021hit, dong2019dual, TimeSformer, arnab2021vivit] has occurred across many video-text benchmarks [caba2015activitynet, xu2016msr, wang2019vatex, chen2011collecting, Rohrbach2015lsmdc, Hendricks2017DiDeMo].
Most such approaches focus on two critical issues. The first is the visual feature representation in the video domain. Different from image, video feature representation considers both spatial and temporal dimensions. Multi-path 2D or 3D convolutional networks [zhang2018s3d, feichtenhofer2019slowfast, wang2019hallucinating] are still the core operators for feature learning, while both the spatial and temporal representations are considered in the same convolution operation for semantic and motion modalities. The other one is multi-modal interaction between video and languages. Based on a large-scale video-text dataset, single-stream or two-stream methods [zhu2020actbert, lei2021less, bain2021frozen, dzabraev2021mdmmt, zhang2018cross, dong2019dual] are adopted to jointly train video features and text features inside the same embedding space. Nevertheless, these two problems are complex enough to make it difficult to achieve both goals in the same network. Some massive pre-training video-text datasets are sorted out to solve this problem, Howto100M [miech2019howto100m]. However, the pretrained models show limited performance gain for video-text retrieval, while annotated video data is hard to collect.
To address these challenges, we rethink the video-text retrieval task from a more macroscopic point of view. While videos and sentences are both sequential, the meaning of a word can be reflected in an image or a sequence frames. For example, atomic actions need to be contextualized with short-term segments, while object is described in single image. Thus, the video-and-language understanding is divided into two independent problems, spatial representation of multi-modal image-text training, and temporal relationships of video frames and video-language. Compared with the video-text pre-training model, the learning of image-text model is much easier. The prominent success of the CLIP [radford2021learning] (Contrastive Language-Image Pre-training) has demonstrated its capability of learning SOTA image representations from linguistic supervision with pre-training on large-scale image and text pairs.
Based on the spatial semantics captured by CLIP [radford2021learning], we present CLIP2Video network to transfer the image-language pre-training model to video-text retrieval with two parts: Temporal Difference Block (TDB) and Temporal Alignment Block (TAB). As their names imply, the two components are devised to confront with the temporal relations of video frames and video-language respectively. We transform the video feature into a feature sequence. For Temporal Difference Block, we add the difference of image frames to the sequence to simulate the motion change. In respect of Temporal Alignment Block, the video sequence and text sequence are aligned to the same space to enhance the correlation between video clips and phrases. Fig.1 shows the structure of these two components. Similar to our work, the concurrent works from [portillo2021straightforward, luo2021clip4clip] are also built on CLIP for video-text retrieval. However, both of these two works only analyze a series of experiments to verify the effects of CLIP model for pre-training. Instead, we further study how to better model the temporal dependency between video frames and video-text, taking advantage of existing remarkable image pretrained model.
In summary, there are three main contributions in our paper:
We put forward a new perspective of video-language learning with two independent modules, image-text multi-modal learning and temporal relationships between video frames and video-text, which respectively solves the multi-modal learning problems in the spatial and temporal aspects.
We introduce two modules, Temporal Difference Block and Temporal Alignment Block, handling with temporal relationship of video frame and video text respectively, which can be used for any video language problem.
We report new records of retrieval accuracy on several text-to-video and video-to-text retrieval benchmarks, including MSR-VTT [xu2016msr], MSVD [chen2011collecting] and VATEX [wang2019vatex]. Accompanied by thorough ablation studies, the large improvements are pin-pointed as contributions by our divided concept.
2 Related Work
Video Representation Learning. Previous works mainly focus on 2D/3D spatial-temporal convolution for video representation. SlowFast [feichtenhofer2019slowfast] explores a network architecture that operates in two pathways and different frame rates. Recently ViT [dosovitskiy2020image], a transformer-based image encoder, which is shown to deliver impressive performance on image categorization, has attracted much attention. While introduced into video domain, ViViT [arnab2021vivit, sharir2021image] and time transformer [TimeSformer] propose several variants of ViT, including those that are more efficient by factorising the spatial and temporal dimensions of the input video. Similar to our work, they have been carried out this idea that adopt separately attentions on temporal and spatial with two-path transformer models. However, their methods still focus on designing an end-to-end network structure to decouple the two problems. We mainly investigate the effective temporal relationships based on multi-modal image-text learning.
Video-language Learning. Learning visual representation from text representation is an emerging research topic with the benefit of large-scale visual and language pairs collection. Howto100M [miech2019howto100m] is one of the largest datasets for video-text multi-modal pretraining. However, there exists much ambiguity noises between text semantics and video content. MIL-NCE [miech2020end] mainly investigates to leverage this noisy instructional videos to learn a better video encoder in an multi-modal learning manner. Others [lei2021less, Deepti2019large, Jonathan2020Web, Tianhao2020cpd] collect videos and accompanied text information from YouTube and Instagram to to learn spatio-temporal features in an efficient weakly-supervised manner. However, compared with image text pretraining data set, the collection of video text data set is much more complex, and its noise is also much larger. This makes the video pretraining model difficult to play a very big role.
Video-Text Retrieval. Early works on video-text retrieval designed intensive fusion mechanisms for cross-modal learning. Based on a large-scale annotated video-text dataset, single-stream or two-stream methods [zhu2020actbert, lei2021less, bain2021frozen, dzabraev2021mdmmt, zhang2018cross, dong2019dual] are adopted to jointly extract video features and language features and project them into the same embedding space. Recently, the pre-trained models have dominated the leaderboard of the video-text retrieval with noticeable results on zero-shot retrieval. Concurrent to our work, [portillo2021straightforward] apply CLIP for zero-shot prediction. We propose to directly transfer the powerful knowledge from the pre-trained CLIP and continue learn the designed video-based CLIP2video model on a video-language dataset. Empirical studies present the effectiveness of the CLIP2video model.
Given a set of captions as queries, our goal is to search for the corresponding videos by mapping video and text into joint embedding space. Inspired by the success of transferring image-text pre-training knowledge into video-text learning [lei2021less], we directly adopt CLIP [radford2021learning] for initialization to extend the ability in text-to-video retrieval. Different from image-to-text retrieval, temporal correlations of visual clues fully reflect the semantics of video, which helps to facilitate cross-modal understanding. So, a temporal difference block is proposed to excite the motion-sensitive interactions explicitly. Meanwhile, we propose temporal alignment block to fully exploit the alignment between context of text and content of key frames.
3.1 Temporal Difference Block
To obtain video embedding, vision transformer (ViT) [dosovitskiy2020image] is adopted firstly to encode every frame into feature. In particular, ViT extracts non-overlapping image patches and perform linear projection to map every patch into token. With injection of positional embedding and extra [CLS] token, the sequence of tokens are input into -layer transformer to model the correlation of each patch, where each layer comprises of Multi-Head Self-Attention () [vaswani2017attention], layer normalization () [ba2016layer], and Multi-layer Perception (). Then, a linear projection is adopted to encode into embedding of the same dimension as text embedding for frame representation.
However, as shown in Fig.1, the spatial ViT models correlation within each frame without consideration of temporal indices. So, to exploit interactions between different frames, we propose a -layer temporal transformer to encode video representation. Frame embeddings output by ViT are concatenated as frame tokens. Since two successive frames contain content displacement, which reflects the actual actions, we explicitly propose temporal difference block to extend the input and guide temporal transformer to encode more motion-related representations. The structure is shown in Fig.2. Specifically, we adopt transformed difference of frame embedding between adjacent time stamps to describe the motion change, which is formulated as:
where P is the positional embedding, and are the two adjacent frame embeddings,
is the sigmoid function,is 1-layer transformer, and is the difference-enhanced token. Instead of adopting subtraction directly to represent difference, we propose to perform difference-level attention with sigmoid transformation. By employing attention transformation on the whole subtraction, the subtraction of successive frame embedding can be encoded to mode long-term relationship of all segments and normalized into [-1, 1] to indicate difference. Then we insert difference-enhanced tokens between every adjacent frames as:
is the final temporal token output from temporal difference block, which is added with positional (P) and type (T) information. So, the frame tokens inserted with difference-enhanced tokens are input into temporal transformer, further promoting the sensitivity to capture motion-related information. Since only describe the motion between frames, we only adopt output of frame tokens as video embedding, which consist of both spatial and temporal information. Then global average pooling is adopted to encode final video representation .
3.2 Temporal Alignment Block
In common text-video retrieval, the modal representation is firstly calculated in individual domain, and then measure the similarity in joint space. By adopting temporal difference block with temporal transformer, the video embedding can be finely encoded. For text representation, we directly adopt CLIP’s text encoder to generate the text representation, which is based on 12-layer transformer [vaswani2017attention] modified by Radford [radford2019language]. Following CLIP [radford2021learning], lower-cased byte pair encoding (BPE) with a 49152 vocab size [sennrich2015neural] is employed to tokenize input caption . The tokenized captions are bracketed with [CLS] and [SEP] token to indicate the start and end. Then text embedding is computed by the text encoder and representation can be seen as , where is the sequence length. So, the output of [CLS] token which named as is utilized as overall representation to minimize the distance with for global matching. However, since the existence of abundant contextual information in , which fully indicates the entire semantics, all word tokens can be adopted as auxiliary supervision to align key frames with apparent motion change.
Inspired by Netvlad [arandjelovic2016netvlad], we propose temporal alignment block to aggregate token embeddings of different modalities with shared centers and re-emphasise context-related representation. Specifically, shared centers are learned to align frame and word embedding jointly. Following [arandjelovic2016netvlad], we calculate the confidence between modal features and shared centers by using dot-product, which is employed to assign weight for each cluster to measure the distribution. The formula is seen as follows:
where indicates -th modal feature, is -th shared center and represents the normalized similarity. Then the aggregated embedding aligned with center can be obtained as:
is the max length of modal feature, and is -th aligned center embedding. is the trainable weight that has the same size as to increase adaption [arandjelovic2016netvlad]. Then center representation can be obtained as . Since the video and text are aggregated with shared centers of the same content, the overall semantic context in every modality token can be fully aligned into joint space before calculating the similarity. To further emphasize the weight of motion-related frame tokens toward action-described centers, we re-sample frame embedding sparsely with double frame rate from as: . Although, sampled in large frame rate loses semantic coherence, it highlights changes in motion, which is beneficial as complementary information to re-adjust the weight distribution of motion-related center. To excite the temporal relationship, the shared temporal difference block is adopted to encode temporal tokens. We adopt a 1-layer transformer to correlate each temporal token and use the output of as . Then is concatenated with as . Finally, aligned video and text representations can be seen as follows:
where and indicates the normalization. By adding the difference-enhanced frame tokens in large frame rate, the weight of action-described centers can be readjust for better alignment. Then global average pooling is also adopted to obtain final aligned representation and for video and text.
3.3 Loss function
To train CLIP2Video, we adopt symmetric cross entropy loss. Each training batch consists of video-text pairs, which is discriminated in training:
indicates cosine similarity, andis symmetric loss. And, we adopt and to calculate
respectively. So, the overall loss function can be seen as. The similarity during inference is formulated as: .
4.1 Experimental Settings
Datasets. We conduct experiment on three benchmarks for video-to-text retrieval and text-to-video retrieval tasks including MSR-VTT [xu2016msr], MSVD [chen2011collecting] and VATEX [wang2019vatex].
MSR-VTT [xu2016msr] contains 10,000 videos, where each video contains 20 captions. We report result on 1k-A[gabeur2020multi, miech2019howto100m, liu2019use] and full protocol [dzabraev2021mdmmt, luo2020univilm] in our paper. Full protocol [dzabraev2021mdmmt, luo2020univilm] is the standard split which includes 6513 videos for train, 497 videos for val and 2990 videos for test. In this protocol, each video contains multiple independent captions, which are all used in text-video retrieval. Besides, 1k-A protocol adopts 9,000 videos with all corresponding captions for training and utilizes 1,000 video-text pairs as test. When reporting the results of video-to-text retrieval, we adopt the maximum similarity among all corresponding captions for a given video query.
MSVD [chen2011collecting] dataset includes 1,970 videos with approximately 80,000 captions, where train, validation and test are splited into 1200, 100 and 670 videos. In this paper, we report the results of test split with multiple captions per video.
Besides, VATEX [wang2019vatex] includes 34,991 videos with multilingual annotations. The training split contains 25,991 videos. Since it is inaccessible to obtain test annotation, we report the results following HGR’s [chen2020fine] validation protocol, which includes 1500 videos for validation and 1500 videos for test. For fair comparison, we adopt the English annotations.
Evaluation Metric. We follow the standard retrieval task [miech2018learning, mithun2018learning, zhang2018cross] and report Recall at rank K (R@K), median rank (MdR) and mean rank (MnR) as metric, where the higher R@K, and lower median rank and mean rank indicates better performance.
Implementation Details. We initialize the basic text transformer and spatial transformer (ViT) with CLIP (ViT-B/32). To initialize the other proposed transformers such as temporal transformer, we reuse parameters of similar dimensions in CLIP. The dimension of all representations and in video and text is 512. The model is finetuned with Adam optimizer. The caption token length is 32 and frame length is 12 in our settings. Meanwhile the length of layer in video transformer are 12 and 4 for and . Besides, the shared centers and are the trainable weight of dimension. And, we adopt =5 for training in MSVD and MSR-VTT and set =7 in VATEX for temporal alignment block. More thorough implementation details can be found in supplemental material.
|Text Video||Video Text|
|Mean Pooling [luo2021clip4clip]||43.1||70.4||80.8||2.0||16.2||43.1||70.5||81.2||2.0||12.4|
|Temporal Transformer [luo2021clip4clip]||44.5||71.4||81.6||2.0||15.3||42.7||70.9||80.6||2.0||11.6|
|Text Video||Video Text|
4.2 Ablation Studies
4.2.1 Effects of Temporal Difference Block.
Compared with mean pooling, the usage of temporal transformer achieves better performance by interacting and aggregating frames. To further enhance the temporal correlation, we adopt the difference of adjacent frame embedding as description of action to insert into the frame tokens. Specifically, difference-level attention (1-layer transformer) is adopted to encode the subtraction as difference. For fair comparison, we report the results of different settings in Tab.1, exploiting the best type of video representation. It can be seen that inserting subtraction directly (TDB-Sub) or just employing MLP (TDB-MLP) to encode correlations, the whole frame representation with positional indices will be damaged. The implicit difference with weak transformation increase the difficulty of temporal transformer to capture the motion-related information. Instead, the usage of difference-level attention (TDB) provides the explicit interaction before inserting, which significantly improves the performance. Besides, we also exploit to whether adopt global average pooling on all the output of tokens. Since the lack of complete spatial information, adopting all the output of tokens (TDB-All) including difference-enhanced tokens, inevitably downgrades the retrieval performance evidently.
|Text Video||Video Text|
|Dual Enc. [dong2019dual]||Full||7.7||22.0||31.8||32.0||-||13.0||30.8||43.3||15.0||-|
|Text Video||Video Text|
|Text Video||Video Text|
|Dual Enc. [dong2019dual]||31.1||67.5||78.9||3.0||-||-||-||-||-||-|
4.2.2 Effects of Temporal Alignment Block.
We adopt frame embedding concatenated with difference-enhanced embedding in double frame rate to aggregate text embedding by aligning the shared centers. In this section, we compare different types of difference-enhanced embedding for alignment and also report the results in Tab.1. By introducing basic alignment (TDB+TAB-base), the performance of video-to-text retrieval has achieved the evident improvement. However, since each token embedding of text contain abundant contextual information, it is hard to aggregate independently modeled frame representation, due to the semantic gap. One way to solve that is to utilize the shared temporal output (TDB+TAB-Temporal) as difference-enhanced embedding, but the weight of motion-related frames can not be strengthened, since every token contains the whole interaction. Instead, we adopt frame embedding for basic alignment, and add extra frame embedding with 1-layer transformer (TDB+TAB-Transformer). The representations with large frame rate encode the apparent motion change in less frames, re-distributing the weight of motion-related center to align with the context of text. Meanwhile, adopting TDB (TDB+TAB-TDB) to further encode the temporal relationship in large frame rate, we have achieve the best performance.
We also give more ablation studies to exploit the settings of hyper-parameters. In Tab.2, we compare the results of different center numbers on MSR-VTT for alignment, based on the results of TDB+TAB-base. The results in Tab.2 shows that performance degrades with the evidently increase of . Since the limited number of videos in MSR-VTT, it is hard for convergence when aligning with large number of centers . So, we choose as the best settings in MSVD and MSR-VTT. Besides, due to the more number of videos in VATEX, we adopt to provide more centers to finely discriminate the key frames. When calculating the loss during training, we adopt , where is 0.5 in our paper. During inference, the similarity between video and text is formulated as: , where is . To exploit the weight of two similarities, we give the results of different and report in Tab.2. With the increase of weight, the confidence of aligned similarity has been weakened, which damages the whole representation performance. Since the alignment is not sufficiently trained with low weight , the retrieval performance of only adopting TDB is better than adopting TDB and TAB simultaneously. However, with the large weight of alignment, loss also can not be converged well, due to the limited initialization. Since both and are equally important, we adopt =0.5 in our paper to achieve the best performance.
4.3 Comparison with Other Methods
We compare our proposed method against the state-of-the-art. All the results of video-to-text and text-to-video retrieval are reported in Tab.3,4,5. We achieve the SOTA results on all three datasets compared with baselines, where the visible growth of performance can be found by employing CLIP [portillo2021straightforward] as pre-trained model. Benefit from the large-scale image-text pre-training, zero-shot retrieval of CLIP with mean pooling has surpassed most of the fine-tuning work. Although, the usage of CLIP doesn’t fully exploit the temporal relationship of video, the powerful spatial representations in short video alleviate the lack of information. Meanwhile, adopting CLIP for fine-tuning and modeling interaction between frames with temporal transformer, the performance can be further increased, which is shown by CLIP4clip [luo2021clip4clip] on Tab.3. However, when give a small dataset such as MSVD, it is hard to learn temporal representation with temporal transformer by only relying positional embedding to indicate temporal indices. Instead, we insert temporal-enhanced tokens between frame tokens to guide temporal transformer to caption dynamic motion patterns. And, the whole contextual words are utilized to align with key frame with abundant semantics by the shared centers. So, the optimization can be supervised to focus on fusing temporal representation, achieving better performance even in small dataset.
4.4 Qualitative Results
We show two kinds of videos retrieved by our proposed method. As depicted in Fig.4, The left two queries demonstrates easy results with large margin of similarity. Since, the scene of video conveys the evident difference, our CLIP2video can will retrieve them by introducing more temporal information to describe the action precisely. Besides, we also give some hard samples, which are shown in the right of Fig.4, where it is hard to discriminate them due to the similar frame. For example, when give query: ”an animated horse is in a barn and the maker asks for comments”, the searched videos with high confidence both contain the animated horse in the barn. However, the half of capitation emphasizes that ” the maker asks for comments”, which is aligned to the specific centers to aggregate target frames. So, the weight of frames including subtitles can be enhanced with temporal alignment block and help to retrieve the video that best meets the description.
We redefine the video-text retrieval from a macroscopic view of point, dividing it into a image-text multi-modal learning and temporal relationships between video frames and video-text. Aiming to consider both sides, we propose CLIP2Video network to transfer the image-language pre-training model to video-text retrieval, which based on a image-language pretraining model and two Temporal Blocks to capture motions at fine temporal frames and re-align the tokens between video and languages respectively. Our experimental results show that the proposed approach can significantly improve the performance on several text-video retrieval benchmarks, including new records on MSR-VTT, MSVD, VATEX.