Advancing High-Resolution Video-Language Representation with Large-Scale Video Transcriptions

by   Hongwei Xue, et al.

We study joint video and language (VL) pre-training to enable cross-modality learning and benefit plentiful downstream VL tasks. Existing works either extract low-quality video features or learn limited text embedding, while neglecting that high-resolution videos and diversified semantics can significantly improve cross-modality learning. In this paper, we propose a novel High-resolution and Diversified VIdeo-LAnguage pre-training model (HD-VILA) for many visual tasks. In particular, we collect a large dataset with two distinct properties: 1) the first high-resolution dataset including 371.5k hours of 720p videos, and 2) the most diversified dataset covering 15 popular YouTube categories. To enable VL pre-training, we jointly optimize the HD-VILA model by a hybrid Transformer that learns rich spatiotemporal features, and a multimodal Transformer that enforces interactions of the learned video features with diversified texts. Our pre-training model achieves new state-of-the-art results in 10 VL understanding tasks and 2 more novel text-to-visual generation tasks. For example, we outperform SOTA models with relative increases of 38.5 R@1 in zero-shot MSR-VTT text-to-video retrieval task, and 53.6 high-resolution dataset LSMDC. The learned VL embedding is also effective in generating visually pleasing and semantically relevant results in text-to-visual manipulation and super-resolution tasks.



There are no comments yet.


page 2

page 7


UniViLM: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation

We propose UniViLM: a Unified Video and Language pre-training Model for ...

BridgeFormer: Bridging Video-text Retrieval with Multiple Choice Questions

Pre-training a model to learn transferable video-text representation for...

Prompting Visual-Language Models for Efficient Video Understanding

Visual-language pre-training has shown great success for learning joint ...

GODIVA: Generating Open-DomaIn Videos from nAtural Descriptions

Generating videos from text is a challenging task due to its high comput...

NÜWA: Visual Synthesis Pre-training for Neural visUal World creAtion

This paper presents a unified multimodal pre-trained model called NÜWA t...

CLIP4Caption: CLIP for Video Caption

Video captioning is a challenging task since it requires generating sent...

Ego-Exo: Transferring Visual Representations from Third-person to First-person Videos

We introduce an approach for pre-training egocentric video models using ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recent years have witnessed an increasing number of videos with the popularity of appealing video websites and mobile apps (e.g., YouTube, TikTok). As the rapid development of smartphone cameras, device storage, and 5G networks, high-quality video creation, and diverse content sharing like travel, sports, and music become a new fashion. Therefore, the capability of video analytic and joint high-level understanding with language plays a key role in many video applications, such as video search [bain2021frozen, patrick2021supportset], video recommendation [TaoVideoRec_MM12], and video editing [xia2021tedigan, patashnik2021styleclip]

. To facilitate video understanding, we study joint video and language (VL) pre-training, which is a new paradigm in both natural language processing


and computer vision

[xu2021videoclip, chen2020uniter, huang2021seeing].

Existing video-language understanding models are highly limited in the scale and scope of video-language datasets. Early datasets (e.g., MSR-VTT [xu2016msr], DiDeMo [anne2017localizing], EPIC-KITCHENS [damen2018scaling]) consist of videos and textual descriptions that are manually annotated by humans. The heavy and expensive annotation cost limits the scale of data. Moreover, datasets with only descriptive sentences are limited in complexity and variability that largely hinders generalization power. Recently, several datasets [bain2021frozen, miech2019howto100m]

are proposed by transcriptions along with videos using ASR (automatic speech recognition), so that the data scale can be greatly enlarged. One of the most representative works is HowTo100M

[miech2019howto100m] that consists of million-scale instructional videos. However, there are still large gaps between these video datasets and real-scenario videos in terms of video quality and semantic diversity.

Figure 1: Examples of video clips and ASR generated transcriptions in the proposed HD-VILA-100M dataset. We present six samples (four frames for each), with diverse video categories covering HowTo & Style, People & Blog, Sports, Travel & Event, Pets & Animals, Film & Animation. Relevant words from auto-generated video transcriptions are manually highlighted in red. [Best viewed in Color]

To solve the above limitations, we propose the HD-VILA-100M dataset (i.e., High-resolution and Diversified VIdeo and LAnguage) that consists of a wide range of video categories and will benefit a plenty of VL tasks, such as text-to-video retrieval [patrick2021supportset] and video QA [lei2021less]. This dataset has the following key properties: (1) Large: we have collected one of the largest video-language datasets, which consists of 100M video clip and sentence pairs from 3 million videos with 371.5K hours in total ( video hour and average sentence length than HowTo100M [miech2019howto100m]). (2) High resolution: all the videos are 720p which is much higher quality than existing datasets that are mostly 240p or 360p. (3) Diverse and balanced: we cover a wide range of topics from the YouTube, with 15 popular categories (e.g., sports, music, autos). Meanwhile, we ensure a balanced video clip number in each category to ease underfit problem.

To enable video-language pre-training, effective video representation plays a key role. Due to computational limitations (e.g., memory), previous works either 1) adopt simple frame-based encoders, and turn to end-to-end visual encoding and multimodal fusion [lei2021less], or 2) choose advanced spatiotemporal encoders [carreira2017quo, xie2018rethinking], while having to do visual encoding and multimodal fusion step-by-step. Few works can learn joint spatiotemporal video representation in end-to-end video-language pre-training.

In this paper, we propose to utilize hybrid image sequence that consists of few high-resolution (HR) frames and more low-resolution (LR) neighbor frames for multiple video learning tasks. Such a design enables end-to-end training with high-resolution spatiotemporal video representation. To achieve this goal, we will answer the following questions regarding model designs: (1) Which HR and LR frames should be sampled? (2) How to learn spatiotemporal features with the hybrid image sequences? To tackle these two problems, we first randomly sample HR frames from a video clip to ensure the robustness of learned video features. LR frames are uniformly sampled from the surrounding of HR frames considering that middle frames contain similar spatial information and are critical to temporal feature learning. Second, we propose to encode HR and LR frames separately while mapping the HR feature to a joint embedding space with LR features by a hybrid Transformer. Such design ensures the spatiotemporal representation of videos covers both HR and LR frames in a learnable way. The learned spatiotemporal feature is further combined with detailed spatial features, followed by a multimodal Transformer that learns to optimize video and language embedding in an end-to-end manner.

Our contributions are summarized as follows: 1) We use automatic video transcriptions to build to-date the largest high-resolution and diversified video and language datasets; 2) We propose a novel pre-training framework to learn spatiotemporal information for video representation from hybrid image sequences that consists of HR and LR frames; 3) Extensive experiments verify the effectiveness of the learned cross-modality embedding in 10 video understanding and 2 more text-to-visual generation tasks.

Dataset Domain #Video clips #Sentence Avg len(sec) Sent len Duration(h) Resolution
MSR-VTT [xu2016msr] open 10K 200K 15.0 9.3 40 240p
DideMo [anne2017localizing] Flickr 27K 41K 6.9 8.0 87 -
LSMDC [Rohrbach2016MovieD] movie 118K 118K 4.8 7.0 158 1080p
YouCook II [Zhou2018TowardsAL] cooking 14K 14K 19.6 8.8 176 -
ActivityNet Caption [Krishna2017actnetcaption] action 100K 100K 36.0 13.5 849 -
WebVid-2M [bain2021frozen] open 2.5M 2.5M 18.0 12.0 13K 360p
HowTo100M [miech2019howto100m] instructional 136M 136M 3.6 4.0 134.5K 240p
HD-VILA-100M (Ours) open 100M 100M 13.4 32.5 371.5K 720p
Table 1: Statistics of HD-VILA-100M and its comparison with existing video-language datasets.

2 Related Work

Video Representation

Video representation are typically designed with 2D/3D CNNs [szegedy2015going, he2016deep, tran2015learning, carreira2017quo, xie2018rethinking] or Transformers [gberta_2021_ICML]. Pioneering works of VL pre-training [sun2019videobert, sun2019learning, luo2020univl, patrick2021supportset, zhu2020actbert] adopt pre-extracted video features (e.g., S3D [zhang2018s3d], I3D [carreira2017quo]) for video representation. While in image-language pre-training, researchers find that end-to-end training will decrease the domain gap of visual representation and improve the generalization for image-text tasks [huang2021seeing]. While for video representation, it is too heavy to make the video-based encoder (e.g., S3D, I3D, ResNet [he2016deep], SlowFast [feichtenhofer2019slowfast]) trainable. Thus, some works [lei2021less, zellers2021merlot] utilize the image-based encoder (e.g., ResNet [he2016deep], ViT [dosovitskiy2020vit]) with a sparse sampling mechanism to make the visual encoder trainable. In this paper, we explore how to make a video encoder trainable in consideration of both spatial and temporal features.

Video-Language Pre-Training

Vision and language pre-training has attracted extensive attention in very recent years. Aligned with the success of image-language pre-training [huang2021seeing, li2020oscar, chen2020uniter], video-language pre-training is showing more and more promising potentials [sun2019videobert, luo2020univl, li2020hero, zhu2020actbert, tang2021decembert, lei2021less, patrick2021supportset]. Among them, some works concentrate on specific type of downstream tasks such as video-text retrieval [bain2021frozen, xu2021videoclip] and video question answering [zellers2021merlot]. In this paper, we explore to pre-train a generalized model on diversified and large-scale data to adapt to different video-language tasks. Video-language pre-training tasks can be mainly categorized into two types: reconstructive, contrastive. Reconstructive methods [sun2019videobert, zhu2020actbert, li2020hero, tang2021decembert] usually adopt an early fusion architecture and aim to reconstruct a masked part in the visual or textual domain. Typical pre-training tasks are masked language modeling (MLM), masked frame modeling (MFM), frame order modeling (FOM). Contrastive methods [sun2019learning, miech2020end, ging2020coot, zellers2021merlot] are inspired by contrastive learning and target to learn video-text matching. In this paper, we combine these two types of objectives for the final target.

3 Dataset

To facilitate the multimodal representation learning, we collect HD-VILA-100M, a large-scale, high-resolution, and diversified video-language dataset. In this section, we will introduce how we collect the videos and the way we select clips and process their corresponding texts. We have also an overview of the statistics of the whole dataset. HD-VILA-100M will be released in the future.

3.1 Video Collection

We choose YouTube as the video resource since it covers diverse categories of videos uploaded by different users, ranging from documentary films by professional TV channels to everyday vlogs by ordinary users. To cover more topics, we start from several official topics of YouTube videos. To ensure the high quality of videos as well as better alignment of video and transcription, we search on the YouTube website and a video analysis website 111 to find popular YouTube channels, such as BBC Earth, National Geography, etc. Videos in these channels and videos appeared in YouTube-8M [abu2016youtube] and YT-Temporal-180M [zellers2021merlot] make up a list of 14 million videos. We only keep videos with subtitles and 720p resolution. We then limit the time length of each category to 30K hours to avoid long tail. Finally, we obtain 3.1M videos in total with high-quality and distributed in 15 categories in balance (as in Figure 2).

3.2 Video Clip Selection and Text Processing

To effectively generate video-text pairs, we use transcriptions along with the videos as the language in HD-VILA-100M. Different from traditional video-language datasets [xu2016msr, anne2017localizing] that use manual annotations of descriptive sentences for videos, transcriptions are available in large quantities and involve richer information than descriptions. However, many subtitles in YouTube videos are generated by ASR and are usually fragmentary and lacking punctuation. To overcome this problem, we utilize an off-the-shelf tool 222 to split the subtitles for complete sentences. Then we make video clips by aligning the sentences to corresponding clips via Dynamic Time Warping using the timestamp of the original subtitles. After processing, each pair in the HD-VILA-100M consists of a video clip about 13.4 seconds on average and a sentence with 32.5 words on average.

Figure 2: The distribution of categories in HD-VILA-100M dataset: (a) video, (b) video clip. [Best viewed in Color]

3.3 Data Statistics

The detailed data statistics of HD-VILA-100M are listed in Table 1. Compared with other video-language datasets, HD-VILA-100M is the largest video-language dataset in terms of duration and word number. More videos indicate richer visual information contained in HD-VILA-100M and longer sentences mean that the language includes more detailed and richer semantics. Compared with HowTo100M [miech2019howto100m] which only includes instructional videos, HD-VILA-100M is derived from a wide range of domains and videos of each category is relatively balanced as shown in Figure 2. This merit can improve the generalization power of the pre-trained model. Moreover, all the videos in HD-VILA-100M are in 720p and the high quality ensures detailed information for video representation learning. In summary, HD-VILA-100M represents the largest, high-resolution, and diversified dataset for video and language learning.

4 Approach

Figure 3: The framework of HD-VILA. Yellow and green colors indicate HR- and LR-related input, operation and output, respectively. Hybrid Transformer learns spatiotemporal representation from HR and LR features. [Best viewed in Color]

Figure 3 shows the overall framework of High-resolution and Diversified VIdeo-LAnguage (HD-VILA) model that consists of three parts: (a) hybrid video encoder, (b) language encoder, and (c) multi-modal joint learning.

4.1 Hybrid Video Encoder

Since the video clips in our dataset are long-range with 13.4 seconds on average, we adopt the strategy of sparsely sampling a sequence of segments from a video clip and then aggregating their predictions similar to ClipBERT [lei2021less]. As explained in Section 1, for each segment , we randomly takes one HR frame at -th timestep and surrounding LR frames to build a hybrid image sequence, where is LR frame sampling rate.

In Figure 3, the hybrid video encoder includes three parts: an HR encoder for HR frame, an LR encoder for LR neighbor frames and a Hybrid Transformer that learns spatiotemporal features by self-attention. HR encoder consists of a 4-stage ResNet and an adapter . LR encoder is a 3-stage ResNet to encode LR frames. Note that and are learnable to ensure both HR and LR frames can be encoded in the same space before feeding into Hybrid Transformer. We extract hybrid spatiotemporal feature of segment as the output of

. In addition, we use the HR frame feature extracted by stage 3 of

(denoting as ) as HR input of :



is an interpolate operation to align feature size. In

, we adopt Divided Space-Time Attention to encode spatiotemporal information similar to [gberta_2021_ICML]. We extract detailed spatial feature of segment as the output of by:


To adapt the output of the HR encoder to the hybrid spatiotemporal feature , consists of a convolution layer to adjust the output feature channel, as well as a max-pooling layer for down-sampling. The segment features is the fusion of and by a linear layer:


4.2 Language Encoder and Multi-Modality Joint Embedding Learning

For both language encoder and multi-modality joint embedding learning, we use self-attention to model the relationship of both uni-modality and multi-modality by Transformer. More specifically, we adopt a 24-layer, 1024-dimensional Transformer, mirroring the BERT-large and we initialize it with pre-trained BERT-large parameters. The first 12 layers are used as the language-only Transformer and the last 12 layers are utilized as multi-modal Transformer. Language-only Transformer extracts language representation which is concatenated with video features of a segment as the input of multi-modal Transformer. We add learnable 1D and 2D position embedding to language and vision tokens, respectively. Such a modal-independent design has two advantages. Firstly, it enables to provide powerful embedding for a single-modal input in downstream tasks. For example, the vision-aware language-only embedding could be used for language-guided video generation tasks. Secondly, the two-stream architecture improves the calculation efficiency of similarity between video and language to linear complexity in some specific downstream tasks, such as video-language retrieval.

[MRSVTT-QA test set. ] Method Acc ST-VQA [jang2017tgif] 30.9 Co-Memory [gao2018motion] 32.0 AMU [xu2017video] 32.5 Heterogeneous Mem [fan2019heterogeneous] 33.0 HCRN [le2020hierarchical] 35.6 ClipBERT [lei2021less] 37.4 Ours 39.2       [MRSVTT multiple-choice test. ] Method Acc CT-SAN [yu2017end] 66.4 MLB [kim2016hadamard] 76.1 JSFusion [yu2018jsfusion] 83.4 ActBERT PT [zhu2020actbert] 85.7 ClipBERT [lei2021less] 88.2 VideoClip [xu2021videoclip] 92.1 Ours 96.3       [TGIF-QA test set. ] Method Action Trans Frame ST-VQA [jang2017tgif] 60.8 67.1 49.3 Co-Memory [gao2018motion] 68.2 74.3 51.5 PSAC [li2019beyond] 70.4 76.9 55.7 HCRN [le2020hierarchical] 75.0 81.4 55.9 QueST [jiang2020divide] 75.9 81.0 59.7 ClipBERT [lei2021less] 82.8 87.8 60.3 Ours 84.0 89.0 60.5

Table 2: Comparison of HD-VILA with state-of-the-art methods on video question answering tasks. (a) Results of ST-VQA and Co-Memory are implemented by [fan2019heterogeneous]. (b) Results of CT-SAN and MLB are implemented by [yu2018jsfusion].

4.3 Pre-Training Tasks

We adopt two pre-training tasks in HD-VILA: video-language matching to enhance cross-modal matching and masked language modeling (MLM) to encourage the mapping between visual and language tokens in fine-grained level. In particular, since the matching between video and language is somewhat weak compared with the video description dataset, we apply contrastive video-language matching to take advantage of large data.

Contrastive Video-Language Matching

To align the feature space of video and language, we use a contrastive loss to maximize the similarity of a video clip and a sentence. Specifically, we treat matched pairs in a batch as positives, and all other pairwise combinations as negatives:


where and are the normalized embeddings of -th video and -th sentence in a batch of size and is the temperature. Video and sentence features are computed by our hybrid video encoder and language encoder. The mean of segment embeddings is used as the video-level embedding.

Masked Language Modeling

We adopt Masked Language Modeling (MLM) to better build the mapping between visual and language domain. MLM aims to predict the ground-truth labels of masked text tokens from the contextualized tokens:


where denotes the text embedding token set, denotes the visual token set, and denotes the masked token.

is sampled from the distribution of text-video pairs. We adopt the same masking strategy as in BERT and use an MLP as the MLM head to output logits over vocabulary, which is then computed as the negative log-likelihood loss for the masked token. We aggregate the logits of different segments to derive a consensus, so that MLM is able to be calculated in video-level as we adopt in the approach.

5 Experiments

In this section, we will conduct extensive experiments to evaluate the proposed HD-VILA pre-training model. To specify our model training, we first introduce implementation details, and then report the state-of-the-art performances in plentiful downstream video-language understanding and visual generation tasks, followed by ablation studies to verify each design module.

5.1 Pre-training Details

Inspired by the idea of “align before fuse”[Li2021AlignBF], we adopt a two-stage fashion for pre-training on HD-VILA-100M dataset. In the first stage, we perform a contrastive video-language matching task on the entire dataset to learn cross-modality alignment. In the second stage, pre-training tasks are performed by MLM to facilitate understanding tasks. For video encoder, we use ResNet-50 for and , and a four-layer transformer with 16 heads and 1024 hidden size for . We empirically divide a video clip into two segments and sample seven frames for each. In this setting, the two segments can cover about 6s video content, which are adequate to model the video clips in our dataset. Besides, we randomly crop areas for the middle high-resolution frames, and select aligned areas for low-resolution neighboring frames. The size of resultant feature map before feeding into the multimodal Transformer is . For language processing, we follow BERT [Devlin2018] to adopt the WordPiece tokenizer to split a sentence into word tokens with a max length of 50.

In pre-training, we use AdamW optimizer [loshchilov2017adamw]

with an initial learning rate of 5e-5 and a fixed weight decay of 1e-3. We also employ a linear decay learning rate schedule with a warm-up strategy. We train our model by using 64 NVIDIA Tesla V100 GPUs. The batch size is set to 512 and the contrastive similarity is calculated on gathered features from all GPUs. The model is trained with five epochs for stage one and five epochs for stage two, till the evaluation metrics in our validation sets saturate. In downstream tasks, we keep the same model configuration if not otherwise specified.

Method R@1 R@5 R@10 MedR
HowTo100M [miech2019howto100m] 14.9 40.2 52.8 9.0
CE [liu2019use] 20.9 48.8 62.4 6.0
DECEMBERT [tang2021decembert] 17.5 44.3 58.6 9.0
HERO [li2020hero] 16.8 43.4 57.7 -
ClipBERT [lei2021less] 22.0 46.8 59.9 6.0
VLM [xu2021vlm] 28.1 55.5 67.4 4.0
MMT [gabeur2020multi] 26.6 57.1 69.6 4.0
Support Set [patrick2021supportset] 30.1 58.5 69.3 3.0
VideoCLIP [xu2021videoclip] 32.2 62.6 75.0 -
Ours 35.0 65.2 77.2 3.0
HT MIL-NCE [miech2020end] 9.9 24.0 32.4 29.5
Support Set [patrick2021supportset] 8.7 23.0 31.1 31.0
VideoCLIP [xu2021videoclip] 10.4 22.2 30.0 -
Ours 14.4 31.6 41.6 17.5
Table 3: Comparison of text-to-video retrieval in MSR-VTT [xu2016msr]. We gray out some lines to highlight fair comparisons with traditional retrieval models and general pre-training models. This mark is also applicable to Table 5, 6.

5.2 Video Question and Answering


(a) MSRVTT-QA [xu2017video] is created based on video and captions in MSR-VTT, containing 10K videos and 243K open-ended questions. Given a question in a complete sentence, the model selects an answer from a pre-defined set. (b) MSRVTT multiple-choice test [yu2018jsfusion] is a multiple-choice task with videos as queries, and captions as answers. Each video contains five candidate captions, with only one positive match. The benchmark has 2,990 questions for the multiple-choice test. (c) TGIF-QA [jang2017tgif] contains 165K QA pairs on 72K GIF videos. We experiment with three TGIF-QA tasks: Action is defined as a multiple-choice task to identify an action that has been repeated in a video. Transition aims to identify the state before or after another state. FrameQA is about open-ended questions about the given video. The task objective is identical to MSRVTT-QA.

Implementation Details

For TGIF Action and Transition, we respectively concatenate five candidate answers with the question into five sequences. On top of the [CLS] token of the question, we train a two-layer MLP to predict the confidence of the five candidates with cross-entropy loss. For MSRVTT-QA and TGIF Frame, we encode the answers in a one-hot fashion, and train 2-layer MLP classifier over all answer candidates with a cross-entropy loss on-top of the [CLS] token of the question. For MSRVTT Multiple-choice, we directly choose the answer with the highest similarity. We choose one segment containing seven frames for TGIF and MSR-VTT. For MSR-VTT, we resize HR frame of each segment to 720p and LR frames to 180p. Due to the various resolution of videos in TGIF, we set the max edge of LR frames to 192 and HR frame to 768, then all pad to square frames. We set max epoch to 80 for TGIF and 20 for MSRVTT-QA with 5e-5 learning rate with a multi-step decay. We report the results of model with best performance on validation set. We set the max batch size to fine-tune on 8 V100 32G GPUs.


The results of HD-VILA on video QA are shown in Table 2. We can find that our model outperforms existing methods on five tasks in all the three datasets, with 1.2, 1.2 and 0.2 absolute improvements on Action, Trans and Frame tasks with TGIF-QA dataset. The limited gain of Frame is because Frame focuses on one frame while hindering the advantage of our hybrid image sequence. On MSRVTT-QA and MSRVTT multiple-choice tests, we achieve 4.8% and 4.6% relative improvement over SOTA methods. Among all the compared methods, ClipBERT [lei2021less] and ActBERT [zhu2020actbert] are two pre-training models. We can see that pre-training with more data will marginally improve the performance. Compared with ClipBERT which is pre-trained on an image-language dataset, videos provide richer information. Note that the language used in ClipBERT for pre-training is more closer to the downstream dataset in both content and length while the language in HD-VILA-100M has a domain gap with TGIF and MSR-VTT languages. This further indicates the generalization of the video representation learned by our HD-VILA.

Method R@1 R@5 R@10 MedR
HERO [li2020hero] 2.1 - 11.4 -
S2VT [venugopalan2014translating] 11.9 33.6 - 13.0
FSE [zhang2018cross] 13.9 36.0 - 11.0
CE [liu2019use] 16.1 41.1 - 8.3
ClipBERT [lei2021less] 20.4 48.0 60.8 6.0
Ours 26.0 54.8 69.0 4.0
Table 4: Comparison of text-to-video retrieval on DiDeMo [Hendricks2017didemo].
Method R@1 R@5 R@10 MedR
JSFusion [yu2018jsfusion] 9.1 21.2 34.1 36.0
MEE [miech2018learning] 9.3 25.1 33.4 27.0
CE [liu2019use] 11.2 26.9 34.8 25.3
MMT [gabeur2020multi] 12.9 29.9 40.1 19.3
Ours 17.2 32.9 43.0 16.0
Table 5: Comparison of text-to-video retrieval on LSMDC [Rohrbach2016MovieD].
Method R@1 R@5 R@10 MedR
FSE [zhang2018cross] 18.2 44.8 89.1 7.0
CE [liu2019use] 18.2 47.7 91.4 6.0
HSE [zhang2018cross] 20.5 49.3 - -
ClipBERT [lei2021less] 21.3 49.0 - 6.0
MMT [gabeur2020multi] 28.7 61.4 94.5 3.3
Support Set [patrick2021supportset] 29.2 61.6 94.7 3.0
Ours 27.4 56.0 93.1 4.0
Table 6: Comparison of text-to-video retrieval on ActivityNet [Krishna2017actnetcaption].

5.3 Video-Text Retrieval


We conduct video-text retrieval experiments on four datasets. (a) MSR-VTT [xu2016msr] contains 10K YouTube videos with 200K descriptions. We follow previous works [yu2018jsfusion, liu2019use], training models on 9K videos, and reporting results on the 1K-A test set. (b) DiDeMo [anne2017localizing] consists of 10K Flickr videos annotated with 40K sentences. We follow [liu2019use, zhang2018cross] to evaluate paragraph-to-video retrieval, where all descriptions for a video are concatenated to form a single query. (c) LSMDC [Rohrbach2016MovieD] consists of 118,081 video clips sourced from 202 movies. Each video has a caption. The validation set contains 7,408 clips and evaluation is conducted on a test set of 1,000 videos from movies disjoint from the train and val sets. (d) ActivityNet Captions [Krishna2017actnetcaption] contains 20K YouTube videos annotated with 100K sentences. We follow the paragraph-to-video retrieval protocols [zhang2018cross, liu2019use] training on 10K videos and reporting results on the val1 set with 4.9K videos.

Figure 4: Text-guided manipulation compared with StyleCLIP [patashnik2021styleclip] and TediGAN [xia2021tedigan]. Our model is able to handle complex descriptions and edit the inputs according to the target attributes (highlighted in red) better. All the inputs are of size.

Implementation Details

Due to the various resolution for videos in downstream datasets, we resize HR frame of each segment to 720p and LR frames to 180p. We adopt stage one model and the same training methods and objective for fine-tuning. We adjust the number of sampled segments and frames according to the average time of videos for each dataset to cover about half of the video. For evaluation, we double the number of segments. We train 100 epochs at most with 5e-6 learning rate with a multi-step decay. We report the results of model with best performance on validation set. If no validation set, we report the results of the last model. To conduct zero-shot evaluation on low-resolution MSR-VTT videos, we crop a patch for each frame and up-sample the middle frames by times. We report the result of the last saved model.

Figure 5: Text-guided super-resolution compared with pSp [richardson2021psp] and SR3 [saharia2021sr3]. Our model is able to reconstruct more accurate target attributes with descriptions (e.g., eyeglasses in the third case). All inputs are upsampled from to .


Table 3, 4, 5, 6 show the text-to-video results of HD-VILA on four datasets. For MSR-VTT, we outperform the previous works that are pre-trained on HowTo100M [miech2019howto100m] by large margins. Compared with VideoCLIP [xu2021videoclip], we have 38.5% relatively gains of R@1 in zero-shot settings, which shows the generalization ability of our learned video feature. Our fine-tuned model outperforms all the baseline models. In LSMDC, we further obtain much larger relative gains with 53.6% under fair comparison. The movie videos are significantly different from How-to style videos. Thus the models pre-trained on HowTo100M are difficult to be adapted to the movie domain. Additional improvement comes from our high-resolution pre-training since LSMDC also contains high-resolution videos. On DiDeMo and ActivityNet, the pre-trained model HD-VILA also achieves better performance. The videos in these two datasets are diversified in both scale and category, and are much longer. In this setting, models usually need more capacity for temporal understanding. Since our videos in HD-VILA-100M are longer and the transcripts have more detailed information, our pre-trained model HD-VILA performs better on long video retrieval. Note that there are also pre-training models that are specifically designed for video-text retrieval task by improving noise contrastive learning like SupportSet [patrick2021supportset], or use more features other than vision and motion like MMT [gabeur2020multi]. To make fair comparison, we gray them out in tables, and just list their results for reference. Compared with these approaches, we can even outperform them in MSR-VTT and LSMDC.

5.4 Text-to-Visual Generation

Recent studies like StyleCLIP[patashnik2021styleclip] and TediGAN [xia2021tedigan] propose to leverage cross-modal pre-training power to facilitate language-guided generation tasks, and have obtained some promising results. As shown in their work, the quality of visual generation results can reflect the quality of cross-modality embedding. Hence, in this section, we will specify how our pre-trained model can achieve this task, and verify our learned embedding by showing higher-quality visualized results compared with SOTA models.


To conduct this research, we introduce the first Face-Description-Video Dateset (FDVD). The dataset consists of 613 high-resolution () videos, resulting in 74,803 frames of human faces. The videos are collected from Ryerson audio-visual dataset [livingstone2018ryerson] and post-processed following Karras et al. [karras2018progressive]. We generate ten different text descriptions for each video following previous works [xia2021tedigan]. To increase the diversity of human faces, we also leverage Multi-modal CelebA-HQ [xia2021tedigan, karras2018progressive] for training.

Implementation Details

We follow previous works [xia2021tedigan, patashnik2021styleclip] to leverage a well pre-trained StyleGAN [karras2019stylegan] as our generator, due to its superior performance. In practice, we learn several linear layers to map the vision and text embedding in HD-VILA to the latent codes used in StyleGAN. Then, images can be generated by the latent codes. To ensure the visual quality, identity preservation, and matching with descriptions of the generated results, we carefully choose a mean-square-error, a LPIPS loss [zhang2018unreasonable], an identity loss [richardson2021psp] and a text-vision matching loss [saharia2021sr3] for optimization following previous works [patashnik2021styleclip, xia2021tedigan, richardson2021psp].

Text-to-Visual Manipulation

We compare our model with the recent state-of-the-art text-guided manipulation models, StyleCLIP [patashnik2021styleclip] and TediGAN [xia2021tedigan] in Figure 4. The results show that our model is able to edit the target attributes of inputs according to text descriptions. For example, in the first case in Figure 4, our model turns the hair to wavy hair and also wears lipstick on the lips, where StyleCLIP and TediGAN fail to wear lipstick on the face.

Type Size R@1 R@5 R@10 MedR
HowTo 720p 3.3 8.2 13.5 113.0
Ours 360p 3.9 11.0 18.3 67.0
Ours 720p 5.5 13.1 20.5 58.0
Table 7: Ablation study on pre-training data. We report results of zero-shot MSR-VTT retrieval.
#HR #LR R@1 R@5 R@10 MedR
1 0 14.3 34.6 49.3 11.0
0 10 25.5 54.6 67.8 5.0
1 6 31.2 59.6 71.0 4.0
1 10 31.8 60.1 71.8 3.0
1 14 30.1 58.4 70.4 3.0
Table 8: Ablation study on frame selection (trained on 8 GPUs). We report results of MSR-VTT retrieval, where #HR/#LR are the numbers of high/low-resolution frames.

Text-to-Visual Super-Resolution

We further compare our model with SOTA super-resolution methods SR3 [saharia2021sr3] and pSp [richardson2021psp]. We generate images from their LR counterparts. Note that this task is extremely challenging due to such low-resolution inputs. As shown in the second case of Figure 5, SR3 [saharia2021sr3] and pSp [richardson2021psp] can not reconstruct high-quality faces by only using visual information. Compared with them, our model is able to accurately reconstruct the lipstick and the straight hair with the help of text description, thanks to the pre-trained models.

5.5 Ablation Studies

In this section, we conduct ablation studies to further verify the effectiveness of the new HD-VILA-100M dataset, and the proposed hybrid video encoder. (1) Diversity of HD-VILA-100M. We sample two video subsets from HD-VILA-100M with two million clip-text pairs for each. One subset only includes “HowTo” type, while the other consists of diversified and balanced categories sampled from the full dataset. As shown in Table 7, compared with the “HowTo” dataset with limited semantics, our diversified pre-training dataset (indicated as “Ours-720p”) helps to achieve higher performance in the MSR-VTT retrieval task, with relative 66.7% R@1 gains. We choose MSR-VTT zero-shot retrieval task for this ablation study, as it is the most widely-used evaluation task in video-language pre-training. (2) High-resolution of HD-VILA-100M. We downsample “Ours-720p” subset into lower resolutions (“Ours-360p”), and observed a significant drop with 29.1% relative decreases of R@1. Such evaluations demonstrate the superiority of the diversified categories and higher resolution of the proposed dataset. (3) Numbers of high/low-resolution frames. As the number of high/low-resolution frames used for video modeling often plays a key role in video pre-training, we adjust frame numbers and fine-tune the pre-training model in different settings. As shown in Table 8, high-resolution frames lead to significant increases compared with the setting only using low-resolution inputs. In particular, the setting of 1-HR & 10-LR achieves the best performance, compared with 0-HR & 10-LR (“0” indicates that one branch is removed), and 1-HR & 0-LR, which demonstrates the rationality of jointly modeling spatial and temporal features in our approach.

6 Conclusion

In this paper, we propose to learn high-resolution and diversified video-language multi-modal representation by pre-training on large-scale video-language pairs. To empower pre-training, we introduce a new dataset HD-VILA-100M which is the largest high-resolution and diversified video-language dataset. To more efficiently employ the richer information in videos, we propose a novel pre-training model HD-VILA that learns spatiotemporal information using HR and LR frames as a hybrid image sequence with a hybrid Transformer. We carefully select 12 video-language understanding and text-to-visual generation tasks to show the capability of HD-VILA-100M dataset and the effectiveness of our pre-trained model.

Limitation and Social Impact

The proposed video-language dataset and pre-training model shows the capacity and generalization of learned VL representation which could benefit many applications of CV and NLP. This is also a result of much computation resource and how to reduce the model size and computing effort becomes more essential for future research. In addition, the usage of user generated data might bring the risk of bias. We tackle this problem by balancing various video categories, yet the videos might contain biased content. Moreover, how to avoid malicious usage of visual generation technique for conscious attack is also critical. However, these concerns are general to the entire fields and are not amplified by this work.