In the last few years, vision-language pre-training has achieved great success on cross-modal representation learning from a large scale of web-crawled data [radford2021learning, jia2021scaling, Li2021AlignBF, wang2021simvlm, zellers2021merlot, zellers2022merlot, bain2021frozen]. Among them, image-text pre-trained models [radford2021learning, jia2021scaling] have shown powerful representation on various downstream tasks, including vision understanding tasks [gu2021open, wang2021actionclip, rao2022denseclip] , image-language generation tasks [patashnik2021styleclip, mokady2021clipcap] and so on [guzhov2022audioclip, zhang2022pointclip]. In light of well-learned and enriched visual representation, some works directly adapt image-text pre-trained models to video-text downstream tasks without further pre-training on video data [luo2021clip4clip, fang2021clip2video, gorti2022x, zhao2022centerclip], while still outperforming models pre-trained on video data [xu2021videoclip, bain2021frozen, xue2022advancing].
Utilizing an existing powerful image-language pre-trained model for further video-language pre-training is able to reduce the required training cost by making good use of the knowledge learned from images. However, adapting image-text pre-training models to video-language data for post pre-training has not demonstrated significant advantage yet, thus is still under explored. A preliminary study is conducted by CLIP4Clip [luo2021clip4clip] which adopts MeanPooling based on the CLIP model on a subset of Howto100M [miech2019howto100m]. While the improvement on video-text retrieval performance is very limited for both zero-shot and fine-tuning settings. In this paper, we aim to explore how to adapt image-language pre-training model (e.g., CLIP) to video representation learning to improve performance on video-language tasks (e.g., text-to-video retrieval) on various datasets.
To unleash the power of video data for adapting image-text pre-training to post pre-training, we conduct several preliminary experiments to figure out the challenges that hinder post-pretraining. First, we explore post pre-training an image-text pre-trained model (i.e., CLIP) with MeanPooling on video-language datasets with different scales, including WebVid-2.5M [bain2021frozen] and HD-VILA-100M [xue2022advancing]. The result shows that the scale of data is critical for video representation learning. Data in small scale makes the model tend to over-fit the new data while the knowledge learned from image-text is suppressed and the performance is reduced. Second, we study the language domain gap between pre-training data and downstream data. By calculating the Normalized Mutual Information (NMI) on clusters of text features, we find that there is a large domain gap between subtitles that are used in large-scale video-language pre-training data and downstream language.
To mitigate the impact of these factors, we propose a Omnisource Cross-modal learning method equipped with a Video Proxy mechanism for video-language pre-training. More specifically, we introduce auxiliary captions that have small domain gap with downstream data into existing large-scale video-text data. To avoid data leakage from downstream tasks and to pursue a high-performance caption generation model, we adopt an image captioning model to generate an auxiliary caption of one frame in each video. During post-pretraining, Omnisource Cross-modal learning fuses caption-frame and video-subtitle pairs in the same batch for joint pre-training. We study a series of variants to find the best fusion strategy. We will release these auxiliary captions to facilitate future research.
In order to keep generality and extendability, we aim to transfer an image model into video domain with minimal modifications. We propose Video Proxy tokens and a proxy-guided video attention mask mechanism, which only increases negligible parameters and calculations to the model. We introduce a set of tokens named video proxies into the CLIP model. During calculating attention in each block, video proxies can interact with all features, while patch features only interact with video proxies and features within the same frame. By experimental results, we find that this mechanism can better model videos containing temporal information, while reducing conflicts with the pre-trained CLIP model.
The experimental results show that our approach is able to improve the performance of video-text retrieval by a large margin compared to image-text pre-trained model. We also conduct ablation studies to verify the effectiveness of each part in our approach, including data used for post-pretraining, model structure and objective functions. Considering that the generation of auxiliary captions in our approach requires a powerful pre-trained model, we also validate direct introducing existing image-text data into post-training. The results are in good consistent with our preliminary analysis.
Our contributions are summarized as follows:
We conduct a preliminary analysis to figure out two factors that hinder video post-pretraining on pre-trained image-text models: data scale and language domain gap;
We then propose a Omnisource Cross-modal learning method equipped with a Video Proxy mechanism to learn from both image-text and video-text pairs;
Extensive experiments verify the effectiveness of our method. Our model outperforms the SOTA results by a large margin on a variety of datasets.
2 Related Work
End-to-end models [lei2021less, xue2022advancing, zellers2021merlot, fu2021violet, huang2020pixel, huang2021seeing, xue2021probing, Li2021AlignBF, kim2021vilt] for vision-language pre-training are replacing the traditional approach using pre-extracted visual features by off-the-shelf models [sun2019videobert, xu2021videoclip, zhu2020actbert, li2020oscar, li2020unicoder, chen2020uniter]. Training end-to-end models on large-scale web-collected data also gradually demonstrates the big advantages [radford2021learning, jia2021scaling, xue2022advancing, zellers2021merlot, zellers2022merlot]. Unlike images that have alt-texts, large-scale video datasets suitable for pre-training usually use subtitles as text sources [miech2019howto100m, xue2022advancing]. Subtitles are much more noisy than alt-texts, According to [miech2019howto100m], Typical examples of incoherence include the content producer asking viewers to subscribe to their channel, talking about something unrelated to the video, or describing something before or after it happens. Bain et al. collect a video dataset WebVid [bain2021frozen] with manually generated captions. Their texts are well aligned with the video and avoid suffering from ASR errors. However, the vast majority of WebVid videos are sourced from stock footage website, so scaling up is under limitation. The video-subtitle data is more easily accessible on the web thus suitable for scaling up. In this paper, we investigate the unfavorable factors of video-subtitle data and explore how to mitigate the impact of these factors.
CLIP for Video-Text Retrieval
The great success of the CLIP has demonstrated its unprecedented power on varies downstream tasks, including vision understanding tasks [gu2021open, wang2021actionclip, rao2022denseclip] , image-language generation tasks [patashnik2021styleclip, mokady2021clipcap] and so on [guzhov2022audioclip, zhang2022pointclip]. By contrastive learning on large-scale image-text pairs, CLIP learns enriched visual concepts for images. Recently, some works directly transfer CLIP to video-text retrieval without futher pretraining on video data (post-pretraining) [luo2021clip4clip, fang2021clip2video, gorti2022x, zhao2022centerclip, wang2022disentangled]. Their work takes the performance of video-text retrieval to a new level, outperforming exsiting models pre-trained on data containing video data [xu2021videoclip, bain2021frozen, xue2022advancing, ge2022bridging, wang2022all]. They transfer CLIP from views of feature aggregation [luo2021clip4clip, zhao2022centerclip, fang2021clip2video, gorti2022x] or representation alignment [fang2021clip2video, gorti2022x, wang2022disentangled]. In parallel with these works, we study post pre-training with video data on top of CLIP in a effective way. Their approaches may also applicable on top of ours.
3 Preliminary Analysis
In this section, we first study the impact of data scale for adapting image-text pre-training to post video-text pre-training. Then we explore the unfavorable factors of large-scale data from the language domain’s perspective.
3.1 Post-pretraining on Different Data Scales
To study the effectiveness of different data scales, we use the CLIP-ViT-B/32 model [radford2021learning] as the base image-text pre-training model and adopt MeanPooling setting for video adaption like CLIP4Clip [luo2021clip4clip]. Two video-language datasets are used: WebVid-2.5M [bain2021frozen] with 2.5 million pairs and HD-VILA-100M [xue2022advancing]
with 100M pairs. We also adopt a subset of HD-VILA-100M containing random 10% data (namely HD-VILA-10M) as a middle setting. We run the same number of steps on all settings, equivalent to 1 epoch on HD-VILA-100M. We uniformly sample 12 frames from each video and apply the same hyper-parameters as described in Section5 for all settings.
During post-pretraining, we evaluate the pre-trained models by fine-tuning on MSR-VTT text-to-video retrieval task. Figure 1 shows the performance trend. We observe an overfitting phenomenon that continuous post-pretraining leads to performance drop. And this phenomenon is more significant on smaller data (e.g., WebVid-2.5M and HD-VILA-10M). As CLIP is pre-trained on 400 million image-text pairs, further training on small data makes the model tend to overfit the new data while the implicit knowledge learned from the image-text pairs is degrading. As a consequence, the performance will drop, even worse than using CLIP directly. This motivate us to adopt HD-VILA-100M due to its large scale and scalability.
3.2 Language Domain Gap with Downstream data
It is intuitive that pre-training data with the same domain as downstream data can benefit downstream tasks. For most video-language tasks like video-text retrieval, the language are descriptive sentences of videos. While for HD-VILA-100M as we decided to use above, the language are auto-generated subtitles, which could have very different correspondence to visual information compared to descriptive texts. Meanwhile, auto-generated subtitles also suffer from irrelevance, misalignment and ASR error. To better explore the domain gap between pre-training data and downstream data, we measure the inconsistency by calculating the dissimilarity between vision-aware language features. For downstream language data, we choose two typical video-text retrieval datasets: MSR-VTT [xu2016msr] and DiDemo [anne2017localizing]. For pre-training language, we select four datasets: video subtitles of HD-VILA-100M (HD-VILA
), videos captions of WebVid-2.5M, image captions of MS-COCO[lin2014microsoft], and web-collected alt-texts of Conceptual Caption 12M [changpinyo2021conceptual]. In addition, we analyze auto-generated captions of HD-VILA-100M (HD-VILA), which will be described in Section 4.
We use a Transformer Encoder initialized from CLIP [radford2021learning]
to extract vision-aware language features. To quantify domain gap, we first mix language features from one pre-training data and one downstream data, and use K-means to get two clusters. Then we calculate the Normalized Mutual Information (NMI) between cluster labels and ground-truths. A larger NMI value means that the two types of features are easily to be distinguished, thus there is a larger domain gap. For each comparison, we randomly sample 1000 texts from each type of data as one setting and the result is the average of 10 settings. We report the results in Table1. From the last column, we find that the NMI score between HD-VILA and downstream data is much larger than others, especially for MSR-VTT dataset. This indicates that direct training with subtitles may introduce inconsistency with downstream tasks.
|NMI Score||MSR-VTT [xu2016msr]||DiDemo [anne2017localizing]||Mean|
Normalized Mutual Information (NMI) score of language features extracted on series of data and downstream tasks. We choose MSR-VTT and DiDemo as downstream tasks. Larger value indicates larger domain gap.
In this section, we first propose a Video Proxy mechanism to trasfer CLIP model to video domain. Then we introduce an in-domain auxiliary data generation method and an Omnisource Cross-modal learning method to reduce the domain gap between pre-training data and downstream data.
4.1 Video Proxy Mechanism
As video is an ordered sequence of frames, it’s critical to learn the frame aggregation and temporality when transfering to video domain. Meanwhile, to keep the high generality and extendability of ViT backbone, we aim to find a simple but effective way to transfer CLIP model to video domain under minimal modifications. Given a video containing frames: , we follow CLIP to divide each frame into N patches: . Then we add spatio-temporal positional embedding to each flattened 2D patches:
where is a linear layer, and is the learnable spatial and temporal positional embedding, respectively. The whole video can be divided and projected into patch tokens.
To model spatial information from multi-frame and temporality, one simple way is directly feeding all tokens into CLIP’s vision encoder and let them attend to each other. However, this method introduce significant conflicts with CLIP. As CLIP is pre-trained on pairs of text and single image, it never handle interactions of tokens between images/frames during training. This is also verified by our experiments in Section 5. Instead, we introduce a Video Proxy mechanism to act as a proxy that helps each local patch perceive video-level information.
Before feeding into CLIP, we concatenate patch tokens with a set of learnable parameters called video proxies: , where is the number of video proxies. Then all tokens will be fed into the ViT of CLIP. The output on first video proxy will be regarded as the video’s representation. All the calculation in ViT are the same as original version except for attention mechanism. In the attention score calculation of each block, video proxies attend to all tokens, while patch tokens only attend to tokens in the same patch plus video proxies. By this mechanism, patch tokens can obtain global information from video proxies, while reducing inconsistencies with the original CLIP’s calculation. Our experiment in Section 5 demonstrates the superiority of this mechanism.
For the input type of the image/frame, we just use linear interpolation to get a middle temporal positional embedding, then treat the image/frame as a special single-frame video. This method enables joint training on both videos and images in the same batch, as our attention mechanism reduces the difference of calculations between video and image.
4.2 In-domain Auxiliary Data Generation
Motivated by the analysis in Section 3, we introduce auxiliary captions into large-scale video-subtitle data to reduce the domain gap between pre-training and downstream data. We adopt image captioning model for two reasons: 1) The training datasets of current video captioning models have overlap with downstream tasks, e.g. MSR-VTT. We avoid data leakage to perform pre-training agnostic to downstream data. 2) The performance of existing captioning models on videos lags far behind that on images. For above reasons, we choose a powerful image captioning model OFA-Caption [wang2022unifying] to generate one caption for the middle frame of each video in HD-VILA-100M. We use the default setting of OFA-Caption model. As a result, we generate 100M sentences with a max length of 16. This approach can be applied to any video data and we will also release the generated captions on HD-VILA-100M to facilitate future research.
4.3 Omnisource Cross-Modal Learning
To learn rich video-language alignment from video-subtitle pairs and reduce domain gap with downstream data by corresponding auxiliary frame-caption pairs, we study joint Cross-Modal learning on the omnisource input. Following most works of learning multimodal alignment on dual encoders [radford2021learning, xue2022advancing, Li2021AlignBF, luo2020univl, xu2021videoclip, luo2021clip4clip], we use info-NCE loss to perform contrastive learning. There are two formats of visual source : video sequences and single frames, and two types of text source : subtitle and caption in our work. We denote them by , , , and respectively. We definite a source-wise info-NCE loss by:
where and are the normalized embeddings of -th visual feature in and -th text feature in in a batch of size . is a learnable temperature. For example, represents info-NCE loss within video-subtitle pairs in a batch, which pulls aligned pairs together in embedding space while pushing apart misaligned pairs.
We study the reasonable variants of Omnisource Cross-Modal Learning: (a) : Simple combination of two source-wise losses on video-subtitle and frame-caption pairs; (b) : As there is also content correlation between videos and its middle-frame captions, we explore to add a loss on video-caption pairs to baseline loss ; (c) : Combination of (a) and (c); (d) : A video corresponds to both a subtitle and auxiliary caption. Compare to (c), the numbers of negative pairs in can be expanded. The in is rewritten as:
where and . The in is equal to (c). We compare all variants with the baseline and report results in Section 5.
|Model||MSR-VTT Retrieval||DiDemo Retrieval|
|2 Video Proxies||45.8||71.3||81.7||66.3|
|4 Video Proxies||46.5||72.1||82.5||67.0|
|8 Video Proxies||45.7||72.7||81.7||66.7|
|Post-pretrain Data||MSR-VTT Retrieval||DiDemo Retrieval|
|HD-VILA + ImageCaption||49.1||73.1||83.5||68.6||47.0||75.3||84.1||68.8|
|Support Set [patrick2021supportset]||30.1||58.5||69.3||3.0|
|Support Set [patrick2021supportset]||29.2||61.6||-||3.0|
|CLIP4Clip by [zhao2022centerclip]||24.1||45.0||55.1||8.0|
We first describe the implementation In this section, we first describe the implementation details during video-language post-pretraining and fine-tuning downstream tasks. Then thorough ablation studies are conducted to demonstrate the effectiveness of our designed CLIP-ViP model which learn video-language representation by video and image joint pre-training. Finally, we make comparisons of our model to existing state-of-the-art methods.
5.1 Experimental Details
For video input, we uniformly sample 12 frames for each video and resize all frames to 224*224. For language, we adopt the CLIP’s tokenizer to split a sentence into word tokens with a max length of 70. We use AdamW optimizer [loshchilov2017adamw] with an initial learning rate of 5e-6 and a fixed weight decay of 5e-2. For learning rate schedule, we adopt a cosine decay with a warm-up strategy. We train our model with 32 NVIDIA Tesla V100 GPUs in a batch size of 1024. The contrastive similarity is calculated on gathered features from all GPUs. We set training steps equal to one epoch on HD-VILA-100M for all ablation studies and 3 epochs for the full setting.
We reuse most hyper-parameters in post-pretraining with some exceptions. 1) Batch size: we fine-tune our model with a batch size of 128 for all downstream tasks. 2) Learning rate and weight decay are set to 1e-6 and 0.2, respectively. 3) Number of epochs: due to the various scales of downstream datasets, we set epoch number to 5, 20, 10, 20 for MSR-VTT, Didemo, LSMDC, ActivityNet, respectively. 4) Frame number, we set frame number to 32 for ActivityNet Captions as its videos are much longer (180 seconds on average). Note that the hyper-parameters of downstream training are the same in all settings in ablation study.
We conduct video-text retrieval experiments on four typical datasets. (a) MSR-VTT [xu2016msr] contains 10K YouTube videos with 200K descriptions. We follow previous works [yu2018jsfusion, liu2019use] to train models on 9K videos, and report results on the 1K-A test set. (b) DiDeMo [anne2017localizing] consists of 10K Flickr videos annotated with 40K sentences. We follow [liu2019use, zhang2018cross] to evaluate paragraph-to-video retrieval, where all descriptions for a video are concatenated to form a single query. (c) LSMDC [Rohrbach2016MovieD] consists of 118,081 video clips sourced from 202 movies. Each video has a caption. Evaluation is conducted on a test set of 1,000 videos from movies disjoint from the train and validation sets. (d) ActivityNet Captions [Krishna2017actnetcaption] contains 20K YouTube videos annotated with 100K sentences. We follow the paragraph-to-video retrieval setting [zhang2018cross, liu2019use] to train models on 10K videos and report results on the val1 set with 4.9K videos.
5.2 Ablation Study
Video proxy mechanism.
For the vision encoder, we evaluate our proposed Video Proxy (ViP) mechanism with different numbers of proxies and compare it with different model structures (i.e. MeanPool, SeqTransformer, Full Attention) by fine-tuning the same pre-training model on MSR-VTT retrieval task. MeanPool simply takes the average of frame features as the representation of the whole video. For SeqTransformer, we follow the seqTransf type in CLIP4Clip [luo2021clip4clip]
and the residual connection in their implementation. Full attention setting takes all patch tokens as the input of the vision encoder and Attention calculation is conducted on all tokens. We also study the impact of different numbers of video proxies. All models are initialized from a CLIP-ViT-B/32. The results are shown in Table3. The simplest model is MeanPool type, which completely disregards temporality. Compared to the MeanPool baseline, SeqTransformer improves the average Recall@1,5,10 by 0.8%. Full Attention type leads to a significant performance drop, although this type allows individual features to interact globally. Besides, we observe the type has a worse initial status and converges slower than other types. One reason is that Full Attention has a significant calculation conflict with CLIP’s in-frame Attention. From Table 3, our Video-Proxy mechanism has the most improvement. Different numbers of video proxies all result in significant performance gain on R@1 (e.g., 3.1% by 4 proxies), while only increasing negligible parameters: 3K compared to 86M of ViT backbone.
Omnisource Cross-modal Learning.
To verify the effectiveness of our proposed Omnisource Cross-modal Learning and compare its variants, we set a post-pretraining and fine-tuning pipeline and adopt same hyper-parameters for all experiments. is the baseline contrastive loss on video-subtitle pairs. After introducing auxiliary captions, we study four variants of Omnisource Cross-modal Learning Loss: (a) ; (b) ; (c) ; (d) . The specific approach is described in Section 4
. We post-pretrain by each loss function for only one epoch due to the costly training, then finetune on two video-text retrieval datasets: MSR-VTT and DiDemo.
We compare with the results of CLIP-MeanPool and CLIP using proposed Video-Proxy mechanism without post-pretraining. The results is listed in Table 2. From the results on MSR-VTT we can see that brings very little improvement: 0.4% on average of Recall@1, 5, 10. According to our preliminary analysis, this is due to the large domain gap between MSR-VTT and post-pretraining data. Combined with auxiliary captions, four variants of Omnisource Cross-modal Learning loss all bring significant improvements: over 3% on Recall@1 and over 2.3% on average of Recall@1, 5, 10. On Didemo data, Based on the improvement brought by , Omnisource Cross-modal Learning further improve the results by a large margin: 8% on average of Recall@1, 5, 10. Finally, performs best thus we apply this variant as our final setting.
In this part, we ablate the contribution of large-scale noisy data and auxiliary data. For uni-source, we use video-subtitle pairs and video-caption data to post-pretrain, respectively, by vanilla contrastive loss. For data combination, we apply Omnisource Cross-modal Learning under setting to post-pretrain on the combined data. From Table 4, Omnisource post-pretraining results are much better than two uni-source results. Especially on MSR-VTT, neither uni-source post-pretraining results in a significant improvement: 67.4% and 66.9% compared with 67.0%. while the Omnisource post-pretraining brings a qualitative improvement: 69.6%. On DiDemo data, data combination nearly double the improvements brought by uni-source. These experimental results verify that it is the data combination itself that is effective, rather than the better data are used.
As the generation of auxiliary captions is based on OFA, a powerful image-text pretrained model, we also explore only including existing data in our post-pretraining. We choose image-text pairs from several widely adopted datasets: MS-COCO, Visual Genome (VG) [krishna2017visual], Flickr-30K [young2014image], SBU [ordonez2011im2text], CC3M [sharma2018conceptual], CC12M as our auxiliary data (ImageCaption). To ablate the contribution of these data, we add experiments of post-pretraining on ImageCaption alone and HD-VILA-100M combined with ImageCaption. From Table 4, post-pretraining on ImageCaption alone results in performance degradation on MSR-VTT and no improvement on DiDemo. In contrast, ImageCaption yields significant performance gains on both datasets when used as auxiliary data for HD-VILA-100M. This further illustrates the importance of the combination of large-scale noisy data and auxiliary data.
5.3 Comparison to State-of-the-art
We compare our model under full setting (3 epochs) with the state-of-the-art (SOTA) works on text-to-video retrieval task. The results of fine-tuning on four datatasets (i.e., MSR-VTT, DiDemo, ActivityNet Captions, LSMDC) are shown in Table 5, 6, 7 and 8, respectively. We clarify the backbone for CLIP-based works. Our model achieve SOTA results on all datasets. Note that some existing methods are also applicable on top of our models as our modification to CLIP model is minimal. However, how to extend our model is out of this paper’s scope. We only add results with DSL [cheng2021improving] to make fair comparison with some methods using post-processing operations e.g., DSL or QB-Norm [bogolin2022cross] like CAMoE [cheng2021improving].
For the MSR-VTT, DiDemo and ActivityNet Captions dataset, our model outperforms existing methods by a large margin on both CLIP-ViT-B/32 and CLIP-ViT-B/16. For LSMDC, our model achieves SOTA results on CLIP-ViT-B/32 and by a large margin on CLIP-ViT-B/16. Note that even under no DSL setting, our results still surpass methods using post-processing operations on most datasets. Note that adding DSL will greatly improve performance of our model Since our model has good bidirectional vision-language correspondence. The good results on ActivityNet Captions dataset also indicates that our model can generalize well to long videos. Overall, the improvements on different datasets demonstrate the good video-language representation of our model.
In this paper, we study further pre-training (post-pretraining) image-text models like CLIP on large scale video data. We first conduct a preliminary analysis to figure out what are the factors hindering video post-pretraining. Motivated by our results, we propose a Omnisource Cross-modal Learning method equipped with a Video-Proxy mechanism. The Video-Proxy mechanism can well model videos containing temporal information, while reducing conflicts with the pre-trained CLIP model. The Omnisource Cross-modal Learning alleviates the problem brought by domain gap between video-subtitle and downstream data. Extensive results show that our approach improves the performance of CLIP on video-text retrieval by a large margin and also achieves SOTA results on a variety of datasets.