DeepAI
Log In Sign Up

An Empirical Study of End-to-End Video-Language Transformers with Masked Visual Modeling

Masked visual modeling (MVM) has been recently proven effective for visual pre-training. While similar reconstructive objectives on video inputs (e.g., masked frame modeling) have been explored in video-language (VidL) pre-training, the pre-extracted video features in previous studies cannot be refined through MVM during pre-training, and thus leading to unsatisfactory downstream performance. In this work, we systematically examine the potential of MVM in the context of VidL learning. Specifically, we base our study on a fully end-to-end VIdeO-LanguagE Transformer (VIOLET), which mitigates the disconnection between fixed video representations and MVM training. In total, eight different reconstructive targets of MVM are explored, from low-level pixel values and oriented gradients to high-level depth maps, optical flow, discrete visual tokens and latent visual features. We conduct comprehensive experiments and provide insights on the factors leading to effective MVM training. Empirically, we show VIOLET pre-trained with MVM objective achieves notable improvements on 13 VidL benchmarks, ranging from video question answering, video captioning, to text-to-video retrieval.

READ FULL TEXT VIEW PDF

page 2

page 10

page 13

page 14

page 15

11/24/2021

VIOLET : End-to-End Video-Language Transformers with Masked Visual-token Modeling

A great challenge in video-language (VidL) modeling lies in the disconne...
11/03/2021

An Empirical Study of Training End-to-End Vision-and-Language Transformers

Vision-and-language (VL) pre-training has proven to be highly effective ...
12/08/2021

Prompting Visual-Language Models for Efficient Video Understanding

Visual-language pre-training has shown great success for learning joint ...
04/18/2021

CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval

Video-text retrieval plays an essential role in multi-modal research and...
05/06/2022

Dual-Level Decoupled Transformer for Video Captioning

Video captioning aims to understand the spatio-temporal semantic concept...
10/17/2016

Learning and Transfer of Modulated Locomotor Controllers

We study a novel architecture and training procedure for locomotion task...
03/28/2021

Low-Fidelity End-to-End Video Encoder Pre-training for Temporal Action Localization

Temporal action localization (TAL) is a fundamental yet challenging task...

1 Introduction

Video, containing multiple modalities in nature, has been used as an epitome to test how AI systems perceive. Video-language (VidL) research aims at extending this ability to convey perception via language. Popular VidL tasks were introduced, such as text-to-video retrieval [80, 29, 59], video question answering [25, 79], and video captioning [80, 6]. Recent progresses in VidL learning mostly focus on VidL pre-training [63, 52, 88] with video-text matching [39, 84] and masked language modeling [11]. There have also been attempts on similar masked modeling on vision inputs. For example, masked frame modeling [39] aims to recover masked frame representations. However, the pre-extracted video features cannot be refined during pre-training, which may limit its effectiveness.

Figure 1: We systematically explore eight masked visual modeling (MVM) targets for end-to-end video-language (VidL) pre-training, including RGB pixel values (Pixel), histogram of oriented gradients (HOG), depth maps (Depth), optical flow (Flow), discrete visual tokens (VQ), spatial-focused image features (SIF), temporal-aware video features (TVF), and multimodal features from CLIP (MMF). Besides MVM, the proposed VIOLET model is pre-trained along with video-text matching (VTM) and masked language modeling (MLM).

Meanwhile, self-supervised vision pre-training has been proven highly effective by reconstructing the masked image patches through raw pixel values [21, 78], discrete visual tokens [3, 86], or visual-semantic features [75, 76]. However, they all only focus on the visual modality. It is unknown how masked visual modeling (MVM) objectives can help VidL learning, especially given that the paired language inputs can already provide high-level semantics.

Motivated by this, we conduct a comprehensive study of MVM for VidL learning. As Figure 1, we base our study on a fully end-to-end VIdeO-LangaugeE Transformer (named VIOLET [16]), and study a broad spectrum of MVM targets, including RGB pixel values (Pixel), histogram of oriented gradients (HOG), depth maps (Depth), optical flow (Flow), discrete visual tokens (VQ), spatial-focused image features (SIF), temporal-aware video features (TVF), and mulitmodal features (MMF). During pre-training, we mask out some proportions of the video input along both spatial and temporal dimensions, and the model learns to recover the MVM targets for these masked patches. Equipped with another two standard pre-training tasks (i.e., video-text matching (VTM) and masked language modeling (MLM)), we empirically verify the effectiveness of different MVM variants on downstream VidL tasks.

Our study reveals that: () spatial-focused image features (SIF) is the most effective MVM target on video-text inputs; and () the effects of different MVM targets on downstream VidL tasks are not shared between video-text and image-text inputs. For example, SIF extracted from the same model brings a large drop on downstream VidL performance when pre-trained with image-text pairs. In addition, we conduct comprehensive analyses of the masking strategy and ratio, combination of different MVM targets, to shed light on effective MVM training for VidL learning.

In summary, our contributions are three-fold. () We present an empirical study of masked visual modeling for video-language pre-training; () We conduct comprehensive analyses with extensive experimental results to shed lights on effective MVM training; and () VIOLET pre-trained with MVM objective achieves strong performance on 13 VidL datasets over 3 popular tasks, covering video question answering, video captioning and text-to-video retrieval. Concretely, compared to models trained on the same 5M pre-training dataset, VIOLET with effective MVM pre-training brings notable mean improvements of +5.4% accuracy on video question answering, +6.6% recall on text-to-video retrieval, and +11.4 CIDEr on video captioning.

2 Related Work

Video-Language Understanding.

Joint video-language (VidL) understanding [41, 44, 26, 32, 17, 54] aims at interpreting the physical world via both vision and text perception. Researchers have explored such capability on VidL tasks including text-to-video retrieval [80, 29, 59, 37, 39], video question answering [25, 79, 35, 36]

, moment retrieval 

[24, 19, 29, 37], and video captioning [74, 87, 80, 59]. Prior arts before the large-scale pre-training era [18, 85, 33, 14, 32, 36] leverage offline extracted video features [27, 72, 5, 77, 15, 10, 22, 30, 1]. Later on, VidL pre-trained models [63, 88, 39, 52] built on the above pre-extracted features have shown promising results. To enhance the performance, there have been parallel interests in bringing in more modalities from raw video inputs [17, 60, 43] and end-to-end training [51, 34, 84, 2], aiming to elevate video representations for VidL modeling.

Masked Visual Modeling (MVM).

Aligned with the success of transformer-based [70] language pre-training [31, 45], image-text pre-training [8, 64] and video-text pre-training [28, 82, 81] have shown promising results on diverse vision-language (VL) tasks. Popular VL pre-training tasks include visual-text matching (VTM) and masked language modeling (MLM), which are directly adapted from language pre-training [11]. Similar masked modeling on visual inputs [8, 39, 13] has also been introduced to VL pre-training, but are not as useful. Among the literature of vision pre-training itself, MAE [21, 67] and SimMIM [78] reconstruct the pixels of the masked image patches to enhance visual representation. BEiT [3], iBOT [86], VIMPAC [65], and BEVT [73] adopt a BERT-like pre-training strategy to recover the missing visual tokens. On the other hand, MaskFeat [75] and MVP [76]

consider latent features for MVM, including hand-crafted HOG features and image features extracted from pre-trained CLIP models 

[55]. Unlike previous works exploring MVM on uni-modal data, in this study, we conduct a comprehensive investigation on how different MVM targets can help VidL learning.

3 Method

We first describe the problem formulation in Section 3.1, and then detail the overall framework of the proposed VIOLET model in Section 3.2. Finally, we discuss eight different target features considered for masked visual modeling (MVM) in Section 3.3.

3.1 Problem Setting

Given a large-scale video-language (VidL) dataset , we aim to pre-train a VidL transformer to learn effective video-text representations. The learned representations can be transferred to downstream tasks for performance improvement. Different from existing works that focus on MVM for pure vision problems [3, 21, 86], we study MVM as a VidL pre-training task. Given a video-text pair where is a sequence of video frames and is a sequence of word tokens. As shown in Figure 1, we randomly mask out some portions of the input frames , and learn to predict the target features corresponding to the masked patches. To output a correct prediction, the model will have to resort to other relevant video frames and/or text tokens . This facilitates cross-modality learning for better VidL understanding.

In addition, we employ the commonly used VidL pre-training objectives, including video-text matching (VTM) and masked language modeling (MLM), where VTM aims to predict whether an input video-text pair is matched or not, while MLM aims to predict the masked word tokens from the surrounding context.111Refer to the Appendix for detailed formulation of VTM and MLM. Our overall pre-training objective can be written as:

(1)

where , , are the MVM, VTM and MLM objectives, respectively.

3.2 End-to-End Video-Language Transformer

We conduct our empirical study using an end-to-end VIdeO-LanguagE Transformer (VIOLET [16]). As Figure 1, VIOLET contains 3 components: Video Swin Transformer (VT), Language Embedder (LE), and Cross-modal Transformer (CT). VIOLET takes video and sentence as inputs. Sparse-sampled frames from are first segmented into a set of video patches, and then processed by VT to compute video features . LE extracts the word embeddings for each word token in . Then, CT performs cross-modal fusion on top of and to produce joint VidL representations , where

denote the hidden representations of video patches, the special

[CLS] token, and other word tokens. We pre-train VIOLET in an end-to-end manner with all three objectives in Equation 1 on top of the outputs from CT.

3.3 Target Features

Masked visual modeling (MVM) is a generic masked feature prediction task, where we mask out some of the visual input patches, and then predict the target features corresponding to the masked ones. Thus, a core design of MVM is the target features, which enables VIOLET learning a desired aspect of visual modeling. While MVM has been explored in pure vision tasks [3, 21, 75], it remains an open question whether MVM can facilitate the interactions between video and language modalities. In this study, we investigate what design of MVM is effective in the context of video-language pre-training?

Following [78, 75], we employ a simple linear layer or 2-layer MLP as the prediction head for MVM, to project the hidden video representations (, of hidden size 768) from CT to the same dimension as the MVM targets. The default MVM loss is the loss, unless specified otherwise. Next, we introduce the considered target features in details.

RGB Pixel Values (Pixel).

We treat the normalized RGB pixel values as one of the candidate target features. During MVM, VIOLET learns to reconstruct the pixel values of the masked patches. The linear MVM head projects into the same dimension as the raw video frame patch ().

Histogram of Oriented Gradients (HOG).

HOG [9] is a pioneer feature descriptor that describes the gradients of orientations of the image. While HOG has been proven effective for visual pre-training [75], it is unknown whether it can benefit VidL pre-training. We extract HOG features in a dense grid level, and use such feature descriptors as the prediction targets of MVM. The HOG feature map is of the same size as the input video frame, but with channel size 1. The linear MVM prediction head projects to the same dimension as HOG for the video frame patch ().

Depth Maps (Depth).

Since depth maps usually contains finer-grained details of the object shapes and general scene layout of the foreground objects, it is worth exploring whether depth maps can be used to improve the scene/object understanding capability of a VidL pre-trained model. To obtain such MVM target, we employ a pre-trained dense prediction transformer (DPT) [57]

to perform monocular depth estimation given an input video frame. The linear prediction head used for Depth is the same as the one for HOG, as both targets are of channel size 1.

Pre-training Tasks MVM Target TGIF-Frame DiDeMo-Retrieval
Acc. R1 R5 R10 AveR
VTM+MLM None 68.1 28.7 57.0 69.7 51.8
+MVM RGB Pixel Values 68.3 (+0.2) 29.2 (+0.5) 58.6 (+1.6) 70.1 (+0.4) 52.6 (+0.8)
Histogram of Oriented Gradients [9] 67.3 (-0.8) 26.6 (-2.1) 54.9 (-2.1) 68.1 (-1.6) 49.8 (-2.0)
Depth Maps (DPT-L [57]) 68.0 (-0.1) 27.3 (-1.4) 55.0 (-2.0) 68.3 (-1.4) 50.2 (-1.6)
Optical Flow (RAFT-L [66]) 67.6 (-0.5) 30.3 (+1.6) 58.0 (+1.0) 70.3 (+0.3) 52.9 (+1.1)
Spatial-focused Image Features (Swin-B [46]) 68.8 (+0.7) 35.4 (+6.7) 62.4 (+5.2) 74.9 (+6.3) 57.6 (+5.8)
Temporal-aware Video Features (VidSwin-L [47]) 68.0 (-0.1) 32.8 (+4.1) 60.5 (+3.5) 73.0 (+3.3) 55.4 (+3.6)

Discrete Visual Tokens (DALL-E 

[56])
68.4 (+0.3) 28.1 (-0.6) 56.6 (-0.4) 69.4 (-0.5) 51.3 (-0.5)
Multimodal Features (CLIP-ViT-B [55]) 67.7 (-0.4) 29.8 (+1.1) 57.8 (+0.8) 68.5 (-1.2) 52.1 (+0.3)
Table 1: Comparing target features for MVM applied to video-text data. All variants are pre-trained on WebVid [2]

for 5 epochs. Masking is performed randomly (RM) with ratio of 15%. The final pre-training setting is highlighted in

gray.

Optical Flow (Flow).

Optical flow is commonly used in motion analysis and video understanding. Here, we analyze whether apparent velocity of objects can benefit VidL pre-training. We employ a pre-trained recurrent all-pairs field transforms (RAFT) [66] to compute optical flow given the consecutive video frames. We directly use the estimated optical flow values as the prediction target, and supervise the MVM training with loss. To obtain the MVM predictions, we concatenate the hidden video representations computed by CT on consecutive frames, and employ a linear layer to project the concatenated video representations (of hidden size 768 2) to the same dimension as the estimated optical flow target for a given patch ().

Discrete Visual Tokens (VQ).

In addition to continuous MVM targets, we also consider the discrete variational autoencoder (dVAE) 

[69, 56] to quantize video inputs. dVAE is learned to tokenize images into discrete visual tokens from a finite dictionary, and then reconstruct the original visual scene based on , where should have a one-to-one correspondence with the input image patches spatially. We first adopt dVAE to tokenize the video frame into : , and then a 2-layer MLP is used to project into the finite VQ vocabularies. As VQ token is discrete, we can model MVM with VQ as a classification problem, and adopt the cross-entropy loss to optimize the MVM training, following [3].

Spatial-focused Image Features (SIF).

We investigate whether image features can be useful for improving VidL pre-training. We employ a well-known vision transformer (such as Swin Transformer [46]) to extract the grid features given an input image. We then normalize the extracted grid features and consider them as ground-truth MVM targets. Likewise, we adopt a 2-layer MLP to project to the same dimension as the image feature target.

Temporal-aware Video Features (TVF).

We also study the impact of video features to VidL pre-training. We employ pre-trained video transformer (such as Video Swin Transformer [47]) to compute temporal-aware features for this analysis. Given a set of video frames, we use the transformer to extract video features in the form of space-time cubes, and then apply regression between normalized video features and MVM predictions from a 2-layer MLP head of the masked video patches.

Multimodal Features (MMF).

We further study if the features learned via multimodal pre-training can benefit VidL pre-training. We utilize the vision branch of the ViT-Base backbone [12] in CLIP [55] to extract such multimodal features, and use the normalized features as the prediction targets in MVM pre-training. Again, we apply regression between the MVM predictions made via a 2-layer MLP head and the MMF targets for the masked patches.

4 Study: Target Features for MVM

Settings.

We pre-train VIOLET on WebVid-2.5M [2] for 5 epochs, and report accuracy on TGIF-Frame [25] for video question answering and R1/R5/R10/AveR on DiDeMo [23] for text-to-video retrieval.222We base our ablation experiments on these two representative datasets for fast iteration, our main results are reported on 13 benchmarks in Section 6. Details about downstream adaptation are included in the Appendix. We initialize our Video Swin Transformer (VT) with VideoSwin-Base [47], pre-trained on Kinetics-600 [27]. Language Embedder (LE) and Cross-modal Transformer (CT) are initialized from pre-trained BERT-Base [11]. During pre-training, we sparsely sample 4 video frames and randomly crop them into 224x224 to split into patches with = = 32. For all downstream tasks, we adopt the same video frame size and patch size but 5 sparse-sampled frames. We keep the training recipe (e.g., optimizer settings, masking ratio, training schedule, etc.) consistent across all targets, which we find generally good in practice.333Refer to the Appendix for more on training details. For MVM targets that involve a teacher model, we use official models released by the authors. We compare models pre-trained with 8 different MVM variants to the baseline pre-trained with only VTM and MLM. Our goal is to find the best MVM target features that can provide the largest performance improvement over this baseline. Results are summarized in Table 1. We first categorize the MVM targets into 4 groups, and discuss their performance in details.

MVM Targets TGIF-Frame DiDeMo-Retrieval Acc. R1 R5 R10 AveR Pixel 68.3 29.2 58.6 70.1 52.6 Flow 67.6 30.3 58.0 70.3 52.9 SIF 68.8 35.4 62.4 74.9 57.6 SIF + Pixel 68.8 31.8 60.4 73.0 55.1 SIF + Flow 68.7 34.4 61.5 72.8 56.3
Table 2: Combining MVM targets. All variants are pre-trained on WebVid [2] for 5 epochs, using RM with 15% as the masking strategy. The final pre-training setting is highlighted in gray.
Image Features Train IN-1K TGIF-Frame DiDeMo-Retrieval Model Data ACC@1 Acc. R1 R5 R10 AveR ResNet-50 [22] IN-1K 76.1 67.3 29.1 58.1 69.3 52.2 Swin-T [46] IN-1K 81.2 68.9 33.8 63.6 74.2 57.2 Swin-B IN-1K 83.5 68.3 34.9 63.4 73.9 57.4 Swin-B IN-22K 85.2 68.8 35.4 62.4 74.9 57.6 Swin-L IN-22K 86.3 68.2 33.2 62.4 72.6 56.1
Table 3: Comparing different image feature targets for MVM. All variants are pre-trained on WebVid [2] with VTM+MLM+MVM (SIF) for 5 epochs, using RM with 15% as the masking strategy. The final pre-training setting is highlighted in gray.

One-stage Visual Targets.

We include Pixel and HOG

, as they do not require training a deep neural network in advance to extract these features. Compared to the baseline without MVM objective, regressing the explicit RGB colors contributes to a relatively small gain of +0.2% on TGIF-Frame and +0.8% on AveR for DiDeMo Retrieval. In contrast, HOG renders degradation on downstream video-language (VidL) performance (-0.8% on TGIF-Frame and -2.0% on DiDeMo-Retrieval). We hypothesis that this is due to the missing color information in HOG features, which is critical in VidL understanding.

Supervised Pseudo-label Targets.

We include Depth Maps (Depth) and Optical Flow (Flow). Intuitively, Depth and Flow can be considered as continuous pseudo “labels”, which are made by models trained to perform depth and optical flow estimation [57, 66]. Depth does not improve over baseline with VTM+MLM. The nature of depth maps are to separate the foreground from the background, thus may guide the model to ignore information from the background, even when they are relevant for solving downstream VidL tasks (-0.1% on TGIF-Frame, -1.6% on DiDeMo Retrieval). Flow only focuses on the moving part between frames, while ignores the spatial details of static components, thus fail on more spatially-focused TGIF-Frame task (-0.5%). We also find that the optical flow estimation model easily fails with sparse sampling strategy, which is widely adopted in VidL pre-training.444Please find visualization examples in the Appendix.

Supervised Visual Feature Targets.

We include continuous features extracted from the last layers of image classification model [46] (i.e., Spatial-focused Image Features (SIF)) and action recognition model [47] (i.e., Temporal-aware Video Features (TVF)). We consider regressing supervised features from Swin-B or VidSwin-L as a type of knowledge distillation from unimodal models to our VIOLET. SIF achieves significant improvement over baseline (+0.7% on TGIF-Frame and +5.8% on AveR for DiDeMo-Retrieval). In contrast, TVF fails to improve TGIF-Frame accuracy (-0.1%), though it brings notable improvement on retrieval performance (+3.6% on AveR). By distilling the knowledge from Swin-B, we enforce the model to focus more on spatial details of each frame, which we hypothesize is the main reason behind the large performance improvement. As previous study [4] pointed out, existing VidL benchmarks largely test on spatial understanding about the key frame of the video, with only a fractional of examples actually testing on temporal reasoning over multiple frames.

Self-supervised Multimodal Feature Targets.

We use Discrete Visual Tokens (VQ) from DALL-E [56] and continuous Multimodal Features (MMF) extracted from CLIP [55]

. Both models are pre-trained on large-scale image-text datasets, usually much more expensive than all other targets. Both targets improve the performance by a slight margin on only one task. VQ that can capture patch-level semantics, benefits TGIF-Frame (+0.3%) which mostly focuses on scene understanding. While MMF from CLIP, contrastively pre-trained to measure the high-level similarity between the entire image and text sentence, is helpful for DiDeMo-Retrieval (+0.3% on AveR).

Summary.

Among all targets, regressing RGB values (Pixel) and distilling features from Swin-B [46] (SIF) are the only two that produce consistent gains over the baseline on both downstream tasks. MVM with SIF achieves the best performance, with a gain of +0.7% on TGIF-Frame and +5.8% on AveR for DiDeMo-Retrieval over the baseline. Therefore, we use SIF as the default target for MVM in the following sections, unless specified otherwise.

Masking Time Cost TGIF-Frame DiDeMo-Retrieval Strategy hours Acc. R1 R5 R10 AveR RM 8.0 68.8 35.4 62.4 74.9 57.6 BM 8.0 69.0 35.9 63.3 74.6 57.9 AM 34.5 68.4 31.5 59.9 72.0 54.7 RM+BM 8.0 68.7 36.4 64.2 74.4 58.3 RM+AM 20.5 68.8 33.7 63.2 73.5 56.8 BM+AM 20.5 68.9 35.6 61.9 74.4 57.3 RM+BM+AM 17.0 68.6 34.7 62.0 74.8 57.2
Table 4: Impact of masking strategy of MVM. All variants are pre-trained on WebVid [2] with VTM+MLM+MVM (SIF) for 5 epochs. The masking ratio is set as 15% for all masking strategies. The final pre-training setting is highlighted in gray.
TGIF-Frame DiDeMo-Retrieval Acc. R1 R5 R10 AveR 15% 68.8 35.4 62.4 74.9 57.6 30% 68.8 36.2 64.0 74.5 58.2 45% 68.9 35.6 61.9 74.4 57.3 60% 68.1 34.1 63.9 74.6 57.5 75% 68.3 35.4 62.4 74.2 57.3
Table 5: Impact of masking ratio of MVM. All variants are pre-trained on WebVid [2] with VTM+MLM+MVM (SIF) for 5 epochs, using RM as the masking strategy. The final pre-training setting is highlighted in gray.

5 Analyses of MVM

Combining MVM Targets.

As different MVM targets focus on different aspects of visual modeling, a naive way to enable model with different visual capabilities is to combine them together. Specifically, the model pre-training can be supervised by more than one MVM loss, which are simply added together to be backpropagated. In Table 

3, we find there is no merit in combining different MVM targets, leading to worse downstream performance than using SIF alone. When combining the best two targets found in Table 1

: Pixel+SIF, it performs better than Pixel only, but does not improve over using SIF alone. We hypothesis that the explicit details of pixel values may conflict with the high-level visual semantics summarized in the grid features from the image classifier. We further try to combine SIF with Flow in the hope of enforcing both temporal and spatial reasoning over video inputs. In addition, Flow is a better candidate than other targets, as it demonstrates some advantages on retrieval performance in Table 

1, and it is a different type of targets from SIF, compared to temporal-aware video features. The results are consistent, with improvements over optical flow only; while the performance drops, compared to SIF alone. Though our results are not encouraging, we believe how to effectively combine different MVM targets is an interesting direction for future study.

MVM Target Extractors vs. Downstream Performance.

In Table 3, we explore different image classification models as the MVM target extractor for SIF, and verify whether stronger image classification model enables better VidL performance. We compare ResNet-50 [22], Swin-Tiny/Base/Large [46]

, trained on ImageNet-1K (IN1K) or ImageNet-22K (IN-22K) 

[10]. Our results suggest that downstream VidL performance is not directly proportional to image classification accuracy. However, distilling grid features from Swin architecture is evidently more effective than that from ResNet-50, as Swin models share similar inductive bias as the VideoSwin backbone in VIOLET.

Masking Strategy.

We investigate the effect of different masking strategies in Table 5, including random masking (RM), blockwise masking (BM), attended masking (AM), and their combinations. Below, we introduce each masking strategy in details.

  • [leftmargin=*, topsep=2pt]

  • Random Masking (RM). Following the conventional practice in MLM, we randomly select a certain percentage of video frame patches from the whole video inputs to be masked. In Table 5, we explore different masking ratios (), and empirically find gives the best downstream performance.

  • Blockwise Masking (BM). To make MVM relying less on similar neighbor patches, we adopt blockwise masking [65, 3] that masks blocks of video patches along spatial-temporal dimension rather than independently masking randomly sampled patches for each frame. Specifically, we randomly sample an as a masking block, where all visual patches in the following consecutive frames will be masked; we repeat this process until of video patches are masked to perform MVM pre-training.

  • Attended Masking (AM). Attended masking tries to put more weights on the more important elements based on the attention weights computed by Cross-modal Transformer (CT). A similar idea has been explored in [84] for MLM. Here, we extend AM to both visual and textual modalities. We first keep the video-text inputs intact, feed them into CT to compute the attention weights, to decide which portions in video and text are more important. We then select the top of most-attended patches/tokens to be masked in video-text inputs for MVM and MLM.

To combine different masking strategies, we randomly apply one masking method for each video-text pair in a batch. Results in Table 5 suggest that TGIF-Frame can slightly benefit from BM, and combining BM with RM leads to the best retrieval performance on DiDeMo. As video usually presents analogous visual patterns in spatial-temporal neighbors (i.e., nearby patches within current frame or neighboring frames), when masking patches independently (i.e., RM), these neighbors can make the masked patches easy to recover, and may lead to spurious success in MVM evaluation. By masking a block (i.e., BM) instead of individual patches, the model cannot merely rely on similar neighboring visual cues but requires actual visual reasoning to recover a group of missing patterns. Combining BM with RM leads to more diverse dropout patterns in video inputs, which is in analogy to data augmentation.

Pre-training Tasks MVM Target TGIF-Frame DiDeMo-Retrieval
Acc. R1 R5 R10 AveR
ITM+MLM None 69.8 36.4 64.3 74.7 58.4
+MVM RGB Pixel Values 69.7 (-0.1) 35.8 (-0.6) 64.4 (+0.1) 74.9 (+0.2) 58.4
Histogram of Oriented Gradients [9] 69.8 34.9 (-1.5) 64.4 (+0.1) 75.1 (+0.4) 58.1 (-0.3)
Depth Maps (DPT-L [57]) 69.6 (-0.2) 32.3 (-4.1) 63.8 (-0.5) 74.2 (-0.5) 56.9 (-1.5)
Spatial-focused Image Features (Swin-B [46]) 69.7 (-0.1) 31.6 (-4.8) 60.5 (-3.8) 72.5 (-2.2) 54.9 (-3.5)
Discrete Visual Tokens (DALL-E [56]) 69.8 34.4 (-2.0) 62.6 (-1.7) 75.1 (+0.4) 57.4 (-1.0)
Multimodal Features (CLIP-ViT-B [55]) 69.8 33.6 (-2.8) 62.9 (-1.4) 75.6 (+0.9) 57.4 (-1.0)
Table 6: Comparing target features for MVM applied to image-text data. All variants are pre-trained on CC3M [62] for 5 epochs. Masking is performed randomly (RM) with ratio of 15%.
Pre-training Tasks MVM Target TGIF-Frame DiDeMo-Retrieval
WebVid2.5M CC3M Acc. R1 R5 R10 AveR
VTM+MLM None None 69.7 36.7 66.5 76.6 59.9
+MVM Spatial-focused Image Features (Swin-B [46]) None 71.1 38.8 69.6 80.0 62.8
Spatial-focused Image Features (Swin-B) Pixel 71.3 39.7 69.3 78.4 62.5
Table 7: Combining MVM target features for both video-text and image-text data. All variants are pre-trained on WebVid2.5M [2] +CC3M [62] for 5 epochs. The final pre-training setting is highlighted in gray.

In addition, AM and combinations with AM are not effective for both downstream tasks. It is also worth noting that AM greatly increase the training time (4 times more than RM/BM), due to the additional forward pass needed to compute the attention weights. In our implementation, we optimize the three losses altogether in the same forward-backward pass. Hence, the performance drop with AM may be due to the important elements (e.g., visual patches containing the main object or content words) are more likely to be masked together and leaving the less relevant elements (e.g., scene background or stop words) intact, which will especially make the learning of video-text matching harder.

Applying MVM to Image-Text Data.

As image can be considered as a special case of video with temporal size 1, video-language (VidL) pre-training can take advantages of image-text data, which has been proven successful in [34, 2]. The current trend in VidL pre-training is to leverage both video-text data and image-text data. Therefore, we repeat the experiments in Section 4 and examine which MVM targets work the best on downstream VidL tasks, when pre-trained on image-text data only. We remove optical flow and temporal-aware video features from this study, as the inputs are static images. In Table 6, we pre-train VIOLET on CC3M [62] for 5 epochs and report results on TGIF-Frame and DiDeMo-Retrieval. The performance trend with different MVM targets are not consistent with that observed on video-text data. Pixel is able to largely preserve the baseline (VTM+MLM) performance, while other MVM targets lead to different degrees of performance drop, especially on retrieval. Without visual implications from neighbor frames as video, MVM is more challenging to learn on image data. On the other hand, MVM over an image may easily fit in static visual representation, which could hurt video temporal reasoning and not benefit downstream VidL learning.

Combining Video-Text Data with Image-Text Data.

We further follow [2, 38] to use both video-text data and image-text data for pre-training, and investigate different ways to combine MVM targets on image and video data in Table 7. Note that we adopt the best training strategy found in the above investigations, that is, using spatial-focused image feature (SIF) as MVM target for video inputs, and using blockwise masking (BM) + random masking (RM) with masking ratio of 30% as the masking strategy. As the best MVM target (Pixel) on image data does not show improvement over the baseline without MVM objective in Table 6, we explore with/without MVM objective on images in this combined pre-training. For the baseline with VTM+MLM only, we simply remove the MVM objective on both image and video data, while keeping the rest training settings. Under the strict fair comparison, we observe adding MVM objectives contributes to +0.4% gains on TGIF-Frame and +2.6% increase on AveR for DiDeMo-Retrieval. Comparing with or without MVM objective on images, they achieve comparable performance on both tasks. Therefore, in our final setting, we only apply MVM objective on video data.

# Pretrain TGIF [25] MSRVTT[80] LSMDC [68] MSVD [6] Captioning
Method videos/images Act. Trans. Frame MC [83] QA [79] MC FiB QA [79] MSRVTT MSVD
ClipBERT [34] 0.2M 82.8 87.8 60.3 88.2 37.4 - - - - -
ALPRO [38] 5M - - - - 42.1 - - 46.3 - -
SwinBERT [42] - - - - - - - - - 53.8 120.6
Models pre-trained on more data
JustAsk [81] 69M - - - - 41.5 - - 46.3 - -
MERLOT [84] 180M 94.0 96.2 69.5 90.9 43.1 81.7 52.9 - - -
All-in-one [71] 283M 95.5 94.7 66.3 92.3 46.8 84.4 - 48.3 - -
MV-GPT [61] 53M - - - - 41.7 - - - 60.0 -
VIOLET 5M 94.8 99.0 72.8 97.6 44.5 84.4 56.9 54.7 58.0 139.2
Table 8: Comparison with SOTA on video question answering and video captioning. Accuracy and CIDEr scores are reported for QA and captioning. VIOLET is pre-trained on WebVid2.5M [2]+CC3M [62] with VTM+MLM+MVM (SIF on videos) for 10 epochs. We gray out methods that use significantly more pre-training data for fair comparison.
# Pretrain MSRVTT [80] DiDeMo [23] LSMDC [59]
Method videos/images R1 R5 R10 R1 R5 R10 R1 R5 R10
ClipBERT [34] 0.2M 22.0 46.8 59.9 20.4 48.0 60.8 - - -
Frozen [2] 5M 31.0 59.5 70.5 31.0 59.8 72.4 15.0 30.8 39.8
ALPRO [38] 5M 33.9 60.7 73.2 35.9 67.5 78.8 - - -
BridgeFormer [20] 5M 37.6 64.8 75.1 37.0 62.2 73.9 17.9 35.4 44.5
Models pre-trained on more data
HERO [39] 136M 16.8 43.4 57.7 - - - - - -
All-in-one [71] 138M 37.9 68.1 77.1 32.7 61.4 73.5 - - -
Clip4Clip [49] 400M 42.1 71.9 81.4 43.4 70.2 80.6 21.6 41.8 49.8
VIOLET 5M 37.2 64.8 75.8 47.9 76.5 84.1 24.0 43.5 54.1
Table 9: Comparison with SOTA on text-to-video retrieval tasks. All results are reported on R1/5/10. We gray out methods that use significantly more pre-training data for fair comparison.

6 Main Results

In this section, we present the main results of VIOLET on 13 video-language (VidL) tasks, and compare them with prior arts. Table 8 shows the comparison on video question answering (QA) and video captioning. We observe that VIOLET is effective in learning transferable knowledge for the downstream tasks. For example, considering pre-training data at a similar scale (i.e., 5M), as shown in the top rows of Table 8, VIOLET achieves better results than prior arts, including ALPRO [38], ClipBERT [34], and SwinBERT [42], across all considered video QA and video captioning benchmarks. Specifically, when pre-training with the exact same data (i.e., WebVid2.5M [2] + CC3M [62]). VIOLET surpasses ALPRO by 2.4% accuracy on MSRVTT-QA and 8.4% accuracy on MSVD-QA, respectively.

We also compare with other models pre-trained on significantly larger scale of video-text pairs. As shown in the bottom rows of Table 8, although we use less pre-training data than others, VIOLET still achieves comparable or better performance than those large-scale pre-training models.

We observe similar findings on video captioning. On MSRVTT captioning, VIOLET is only 2 points behind MV-GPT [61] pre-trained with 53M video-text pairs, which is 10 times larger than ours (5M). In addition, MV-GPT leverages ASR transcripts to enhance the captioning performance, while our captioning model takes only video frames as inputs and outputs the video caption.555Details about downstream finetuning on captioning benchmarks are included in the Appendix. We believe augmenting VIOLET with additional modalities, such as audio or ASR transcripts, can further improve captioning performance, which we leave as future work.

Table 9 presents the comparison on text-to-video retrieval. When pre-training with the same datasets (i.e., WebVid2.5M [2] + CC3M [62]), VIOLET shows across-the-board improvements with all metrics considered on DiDeMo and LSMDC. It is worth noting that our method performs comparably to BridgeFormer [20] on MSRVTT-Retrieval. BridgeFormer adopts a noun/verb masking strategy during pre-training, which is specially aligned to the simple sentences in MSRVTT. However, it cannot show similar effects on DiDeMo and LSMDC due to more complex texts with multiple nouns/verbs. In contrast, the studied MVM can achieve a comprehensive enhancement in VidL learning and lead to notable improvements (+10.9% R1 on DiDeMo and +6.1% R1 on LSMDC).

7 Conclusion

We initiate the first empirical study on adopting masked visual modeling (MVM) for video-language (VidL) learning. We explore diverse MVM objectives upon end-to-end VIdeO-LanguagE Transformer (VIOLET), including low-level pixel space, high-level visual semantics, and extracted latent features. We show that image-text and video-text data may not share the same MVM target. Specifically, spatial-focused image feature (SIF) works the best on video-text inputs; only RGB pixel values can preserve the baseline performance without MVM for static images, paired with texts. Our analyses on different combinations of MVM targets, various SIF target extractors, and varying masking strategies/ratios shed light on effective MVM design. VIOLET pre-trained with MVM objective achieves strong performance on 3 popular VidL tasks, including video question answering, video captioning and text-to-video retrieval, across 13 VidL benchmarks.

Appendix A Additional Results

Weight Initialization of Video Backbone.

We compare the mask visual modeling (MVM) using spatial-focused image features (SIF) by different initialized video backbones in Table 10. At first, although the used video transformer (VT) is randomly initialized, the MVM training still enhances the visual representation and benefits the downstream video-language (VidL) tasks. Furthermore, MVM can also boost better initialized VT from VidSwin-B and lead to a comprehensive increase. Specifically, the improvement gap is more significant than random initialization. With a better initialized VT, we can learn better from MVM and enlarge its effectiveness during pre-training.

Weight Init. MVM TGIF-Frame DiDeMo-Retrieval
Acc. R1 R5 R10 AveR
Random 55.9 5.6 19.9 29.8 18.5
56.5 7.4 22.9 33.8 21.4
VidSwin-B [47]) 68.1 28.7 57.0 69.7 51.8
68.8 35.1 63.3 73.1 57.2
Table 10: Impact of weight initialization of video backbone. All variants are pre-trained on WebVid [2] for 5 epochs. The MVM target is spatial-focused image features (SIF) from Swin-B [46]). The final pre-training setting is highlighted in gray.

Type of MVM Loss.

We compare the type of loss function for the MVM training by using least absolute deviations (

) or least square errors () in Table 11. It is well known that the

loss can be resistant to outlier data. We show that MVM through

is also more robust and leads to better performance on both video question answering and text-to-video retrieval than the loss.

MVM Loss TGIF-Frame DiDeMo-Retrieval
Acc. R1 R5 R10 AveR
68.8 35.4 62.4 74.9 57.6
68.8 33.0 60.1 71.9 55.0
Table 11: Impact of MVM loss type. All variants are pre-trained on WebVid [2] with VTM+MLM+MVM (SIF) for 5 epochs, using RM as the masking strategy with ratio of 15%. The final pre-training setting is highlighted in gray.

MVM Prediction Head.

We investigate the prediction head for MVM in Table 12. As a result, a single linear layer is not enough to model the complicated distilling MVM features. Therefore, we follow VTM and MLM to use 2-layer MLP as the prediction head for MVM.

MVM Head TGIF-Frame DiDeMo-Retrieval
Acc. R1 R5 R10 AveR
1 Linear Layer 68.8 31.3 60.1 72.8 54.7
2-layer MLP 68.8 35.4 62.4 74.9 57.6
Table 12: Impact of MVM prediction head. All variants are pre-trained on WebVid [2] with VTM+MLM+MVM (SIF) for 5 epochs, using RM as the masking strategy with ratio of 15%. The final pre-training setting is highlighted in gray.

TVF target extractors vs. downstream performance.

We compare distilling video features from VidSwin-B vs. VidSwin-L (the default setting in the main text) in Table 13. Here, for experiments with VidSwin-B, the same VidSwin-B weight is used to initialize the video backbone and to extract the MVM target. Hence, the MVM objective can be easily minimized by simply ignoring the text inputs, which conflict with the other objectives. This variant is in principal similar to masked frame modeling in HERO [39], the key difference lies in whether the video backbone in VIOLET is refined during pre-training.

MVM Target TGIF-Frame DiDeMo-Retrieval
Acc. R1 R5 R10 AveR
TVF (VidSwin-L [47]) 68.0 32.8 60.5 73.0 55.4
TVF (VidSwin-B) 67.5 25.8 55.0 68.0 49.6
Table 13: Temporal-aware video feature (TVF) target models vs. downstream performance. All variants are pre-trained on WebVid [2] with VTM+MLM+MVM (TVF) for 5 epochs, using RM as the masking strategy with ratio of 15%.

Additional Exploration in Combining MVM Targets.

We explore the additional combination of distilling MVM targets in Table 14. MVM with SIF has an obvious advantage over TVF only on both video question answering and text-to-video retrieval. While, considering SIF+TVF seems not to bring a robust improvement, especially decreasing text-to-video retrieval. The previous study [4] shows that current VidL benchmarks primarily focus on spatial understanding of the key frame from videos. Furthermore, combining TVF with SIF will result in an excessive training overhead. Accordingly, we choose SIF as our final pre-training setting.

MVM Targets TGIF-Frame DiDeMo-Retrieval
Acc. R1 R5 R10 AveR
SIF 68.8 35.4 62.4 74.9 57.6
TVF 68.0 32.8 60.5 73.0 55.4
SIF + TVF 69.2 33.8 63.0 74.4 57.1
Table 14: Combining target features for MVM. All variants are pre-trained on WebVid [2] for 5 epochs. The final pre-training setting is highlighted in gray.
(a)
(b)
Figure 2: Visualization of optical flow (Flow) predictions by RAFT-L [66] with sparsely sampled frames. We show examples of good cases in (a) and bad cases in (b).

Qualitative Results.

Figure 2 shows good and bad examples of optical flow predictions made by RAFT-L [66] with sparsely sampled frames. In 1(b), the top example shows zoom in shots, and the bottom one shows moving shots, where all content in the current frame move, which is the main reason behind the failure in optical flow estimation.

We also show the visualizations of zero-shot text-to-video retrieval on MSRVTT (Figure 3), DiDeMo (Figure 4), and LSMDC (Figure 5) to demonstrate that MVM can help video understanding from different domains, such as gaming, animation, human activity, or movie scene.

Appendix B Additional Pre-training Details

Vidoe-Text Matching (VTM).

VTM enhances the cross-modal fusion via modeling the alignments between visual and textual inputs. At each training step, we randomly replace the corresponding text for a given video with the text description from a different video in the same batch. Both the positive pair and negative pair are modeled by Cross-modal Transformer (CT), and VTM is to tell them apart from the global VidL representation of the [CLS] token. In particular, will be processed by a fully-connected layer () to learn contrastively through classification:

(2)

where or is of positive or negative pairs.

Masked Language Modeling (MLM).

In MLM, we randomly mask out some word tokens with a probability of 15%.

666Following BERT [11], We replace 80% of masked word tokens as the [MASK] token, 10% as a random token, and 10% as its original token. The goal is to recover these masked word tokens from the joint VidL features modeled by CT. Specifically, the corresponding for these masked tokens are fed in a fully-connected layer () and projected to the discrete token space for classification:

(3)

where denotes the index set of masked word tokens.

Implementation Details

Our implementation of VIOLET is based on PyTorch 

[53]. As discussed in the main text and supported by the additional experimental results above, our final pre-training setting is () VTM+MLM+MVM (with MVM target as spatial-focused image features from Swin-B [46], applied on video-text inputs only) as the pre-training tasks; () 2-layer MLP as the MVM prediction head and regression as the MVM loss; and () blockwise masking + random masking with masking ratio of 30% as the masking strategy. We adopt AdamW [48] as the optimizer with a warmup learning rate schedule of 5e-5 peak learning rate, betas of (0.9, 0.98), and weight decay of 1e-3 for all pre-training experiments. We pre-train VIOLET on 32 NVIDIA V100 GPUs with a batch size of 28 per GPU. Pre-training with 10 epochs on WebVid2.5M [2] + CC3M [62] takes about 27 hours to finish. We present the training settings for all finetuning experiments in the next section.

Appendix C Experimental Setup of Downstream Tasks

We evaluate our pre-trained VIOLET on 3 popular video-language tasks, including text-to-video retrieval, video question answering and video captioning, across 13 downstream datasets. For text-to-video retrieval, we report model performance on MSRVTT [80], DiDeMo [23], and LSMDC [59]

and use Recall at K (R@K, K=1,5,10) as the evaluation metric. For video question answering, we consider datasets in both multiple-choice and open-ended settings, including TGIF-Action, TGIF-Transition, TGIF-Frame 

[25], MSRVTT-MC [83], MSRVTT-QA, MSVD-QA [79], LSMDC-MC and LSMDC-FiB [68]. We evaluate our models using accuracy. For video captioning, we report CIDER scores on MSRVTT and MSVD.

We follow the standard training/validation/testing splits of the original datasets. If not otherwise stated, we sparsely sample = 5 video frames and adopt video frame size 224 with patch size = = 32. Similar to pre-training, we use AdamW [48] to fine-tune VIOLET for each downstream task with a warmup learning rate schedule of 2e-5 peak learning rate, betas of (0.9, 0.98), and weight decay of 1e-3. All finetuning experiments are conducted on Microsoft Azure [50] adopting mixed-precision training with DeepSpeed [58].777We conduct retrieval finetuning on 8 x 80GB A100 GPUs to enable larger batch size, while all other fintuning experiments are conducted on 8 x 32GB V100 GPUs. All video data are pre-processed by evenly extracting 32 frames to avoid expensive decoding on-the-fly. During training, we randomly sample frames from 32 frames, resize the shorter side of all frames to 224 and random crop (224x224) at the same location for all the frames in a given video. During inference, we evenly sample frames from 32 frames and center crop (224x224) for all the sampled video frames.

c.1 Text-To-Video Retrieval

For text-to-video retrieval, similar to visual-text matching (VTM) during pre-training, we treat corresponding video-text pairs in the same batch as positives and all other pairwise combinations as negatives. We adopt a fully-connected (FC) layer (FC) over the VidL representation of the [CLS] token to learn through classification:

(4)

where or is of positive or negative pairs. In particular, we use pre-trained FC for zero-shot text-to-video retrieval and to initialize FC for further fine-tuning on each downstream text-to-video retrieval task.

Msrvtt [80]

contains 10K YouTube videos with 200K human annotations. For fair comparison [2, 34], we train on 9K training+validation splits and evaluate on the 1K-A testing split. We adopt batch size 20 per GPU and train for 10 epochs.

DiDeMo [23]

consists of 10K videos annotated with 40K sentences from Flickr. Following [2, 34], we concatenate all sentences from the same video into a paragraph and perform paragraph-to-video retrieval for DiDeMo. We adopt batch size 16 per GPU and train for 10 epochs.

Lsmdc [59]

contains 118K video clips from 202 movies. Each clip has a caption from movie scripts or descriptive video services. Following [2, 52], we evaluate on 1K testing clips that disjoint from the training+validation splits. We adopt batch size 20 per GPU and train for 5 epochs.

VideoQA Task #Option
Multiple-Choice TGIF-Action [25] 5
TGIF-Transition [25] 5
MSRVTT-MC [83] 5
LSMDC-MC [68] 5
Open-Ended TGIF-Frame [25] -
MSRVTT-QA [79] -
MSVD-QA [7] -
LSMDC-FiB [68] -
Table 15: Summary of video question answering tasks. For open-ended Video QA, we do not limit the answer vocabulary to a fixed answer candidate set.

c.2 Video Question Answering

We test our model on video question answering (QA) tasks in both multiple-choice and open-ended settings, as summarized in Table 15. We follow LAVENDER [40] to formulate Video QA as Masked Language Modeling due to its superior performance. For multiple-choice QA tasks, we concatenate question with all answer options and add a [MASK] to form the input text (Q+A0+A1+A2+A3+A4+[MASK]). We adopt the same Masked Language Modeling (MLM) layer used during pre-training upon to predict the word token corresponding to the answer index (e.g., 0, 1, 2, 3, 4). Similarly, for open-ended QA tasks, we apply MLM over the input (Q+[MASK]). Cross-entropy loss is used to supervise the downstream finetuning over the whole word vocabulary.

TGIF-Action, TGIF-Transition, and TGIF-Frame [25]

require spatial-temporal reasoning to answer questions regarding GIF videos in TGIF-QA Specifically, we aim to test our model along three dimensions: () Action: to recognize the repeated action; () Transition: to identify the transition between the before and after states; () Frame: to answer questions about a specific frame from the GIF video. Among them, TGIF-Action and TGIF-Transition are collected under multiple-choice setting, and TGIF-Frame is an open-ended video QA task with free-form answers. We adopt batch size 24 and train for 56/20/10 epochs for Action/Transition/Frame, respectively.

Msrvtt-Mc [83] and MSRVTT-QA [79]

are created based on videos and captions in MSRVTT [80]. MSRVTT-MC is a multiple-choice task with videos as questions, and captions as answers. Each video contains 5 captions, with only one positive match. This setting can be viewed as video-to-text retrieval, hence we simply evaluate the model trained on MSRVTT-Retrieval. MSRVTT-QA contains 243K open-ended questions over 10K videos. We adopt batch size 24 per GPU and training epochs 8.

Msvd-Qa [79]

consists of 47K open-ended questions over 2K videos, based on video-caption pairs from MSVD [7]. We adopt batch size 24 per GPU and train for 10 epochs.

LSMDC-MC and LSMDC-FiB [68]

are built from LSMDC dataset [59]. Similar to MSRVTT-MC, LSMDC-MC requires the model to select the only positive caption that describes the video from 5 caption candidates, and can be formulated as video-to-text retrieval. LSMDC-FiB replaces a word in the question sentence with the [BLANK] token, and requires the model to recover the missing word. We regard LSMDC-FiB as an open-ended Video QA task. In particular, we replace the [BLANK] token with [MASK] token, and use the MLM prediction head over the representation of the [MASK] token to predict the correct answer. We adopt batch size 24 per GPU and train for 10 epochs.

c.3 Video Captioning

For video captioning, we evaluate on MSRVTT [80] and MSVD [6]. MSRVTT consists of 10K videos with 20 captions per video, and MSVD contains 2K videos, with 40 captions per video. We follow the standard captioning splits to train/evaluate with VIOLET.

The captioning finetuning is formulated as masked language modeling (MLM) with a causal attention mask so that the current word token only attends to the tokens before it, but not the ones after it, following SwinBERT [42]. During training, we set the probability of random masking caption tokens to be 0.15, the same as what is used in MLM during pre-training. We adopt batch size 24 per GPU and train for 20 epochs. During inference, we generate the captions auto-regressively. At each generation step, a [MASK] token is appended to the previously generated tokens, and the model will predict the current tokens based on the learned embedding at the [MASK] token position. We perform generation until the model outputs a [SEP], which is defined as the sentence ending token or when it reaches the maximum generation step 50.

Figure 3: Qualitative examples of zero-shot text-to-video retrieval on MSRVTT [80].
Figure 4: Qualitative examples of zero-shot text-to-video retrieval on DiDeMo [23].
Figure 5: Qualitative examples of zero-shot text-to-video retrieval on LSMDC [59].

References

  • [1] P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L. Zhang (2018)

    Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering

    .
    In

    Conference on Computer Vision and Pattern Recognition (CVPR)

    ,
    Cited by: §2.
  • [2] M. Bain, A. Nagrani, G. Varol, and A. Zisserman (2021) Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval. In International Conference on Computer Vision (ICCV), Cited by: Table 10, Table 11, Table 12, Table 13, Table 14, Appendix B, §C.1, §C.1, §C.1, §2, Table 1, §4, Table 3, Table 5, §5, §5, Table 7, Table 8, Table 9, §6, §6.
  • [3] H. Bao, L. Dong, and F. Wei (2022) BEiT: BERT Pre-Training of Image Transformers. In International Conference for Learning Representations (ICLR), Cited by: §1, §2, §3.1, §3.3, §3.3, 2nd item.
  • [4] S. Buch, C. Eyzaguirre, A. Gaidon, J. Wu, L. Fei-Fei, and J. C. Niebles (2022) Revisiting the ”Video” in Video-Language Understanding. In Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: Appendix A, §4.
  • [5] J. Carreira and A. Zisserman (2017) Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset. In Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
  • [6] D. L. Chen and W. B. Dolan (2011) Collecting Highly Parallel Data for Paraphrase Evaluation. In ACL, Cited by: §C.3, §1, Table 8.
  • [7] D. L. Chen and W. B. Dolan (2011) Collecting Highly Parallel Data for Paraphrase Evaluation. In Annual Meetings of the Association for Computational Linguistics (ACL), Cited by: §C.2, Table 15.
  • [8] Y. Chen, L. Li, L. Yu, A. E. Kholy, F. Ahmed, Z. Gan, Y. Cheng, and J. Liu (2020) UNITER: UNiversal Image-TExt Representation Learning. In European Conference on Computer Vision (ECCV), Cited by: §2.
  • [9] N. Dalal and B. Triggs (2005) Histograms of Oriented Gradients for Human Detection. In Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §3.3, Table 1, Table 6.
  • [10] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) ImageNet: a Large-Scale Hierarchical Image Database. In Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2, §5.
  • [11] J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), Cited by: §1, §2, §4, footnote 6.
  • [12] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby (2021) An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In International Conference for Learning Representations (ICLR), Cited by: §3.3.
  • [13] Z. Dou, Y. Xu, Z. Gan, J. Wang, S. Wang, L. Wang, C. Zhu, P. Zhang, L. Yuan, N. Peng, et al. (2022) An empirical study of training end-to-end vision-and-language transformers. In Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
  • [14] C. Fan, X. Zhang, S. Zhang, W. Wang, C. Zhang, and H. Huang (2019)

    Heterogeneous Memory Enhanced Multimodal Attention Model for Video Question Answering

    .
    In Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
  • [15] C. Feichtenhofer, H. Fan, J. Malik, and K. He (2019) SlowFast Networks for Video Recognition. In Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
  • [16] T. Fu, L. Li, Z. Gan, K. Lin, W. Y. Wang, L. Wang, and Z. Liu (2021) VIOLET: End-to-End Video-Language Transformers with Masked Visual-token Modeling. In arXiv:2111.1268, Cited by: §1, §3.2.
  • [17] V. Gabeur, C. Sun, K. Alahari, and C. Schmid (2020) Multi-modal Transformer for Video Retrieval. In European Conference on Computer Vision (ECCV), Cited by: §2.
  • [18] J. Gao, R. Ge, K. Chen, and R. Nevatia (2018) Motion-Appearance Co-Memory Networks for Video Question Answering. In Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
  • [19] J. Gao, C. Sun, Z. Yang, and R. Nevatia (2017) TALL: Temporal Activity Localization via Language Query. In International Conference on Computer Vision (ICCV), Cited by: §2.
  • [20] Y. Ge, Y. Ge, X. Liu, D. Li, Y. Shan, X. Qie, and P. Luo (2022) BridgeFormer: Bridging Video-text Retrieval with Multiple Choice Questions. In Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: Table 9, §6.
  • [21] K. He, X. Chen, S. Xie, Y. Li, P. Dollár, and R. Girshick (2022) Masked Autoencoders Are Scalable Vision Learners. In Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §2, §3.1, §3.3.
  • [22] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep Residual Learning for Image Recognition. In Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2, Table 3, §5.
  • [23] L. A. Hendricks, O. Wang, E. Shechtman, J. Sivic, T. Darrell, and B. Russell (2017) Localizing Moments in Video with Natural Language. In International Conference on Computer Vision (ICCV), Cited by: Figure 4, §C.1, Appendix C, §4, Table 9.
  • [24] L. A. Hendricks, O. Wang, E. Shechtman, J. Sivic, T. Darrell, and B. Russell (2017) Localizing Moments in Video with Natural Language. In International Conference on Computer Vision (ICCV), Cited by: §2.
  • [25] Y. Jang, Y. Song, Y. Yu, Y. Kim, and G. Kim (2017) TGIF-QA: Toward Spatio-Temporal Reasoning in Visual Question Answering. In Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §C.2, Table 15, Appendix C, §1, §2, §4, Table 8.
  • [26] J. Jiang, Z. Chen, H. Lin, X. Zhao, and Y. Gao (2020) Divide and Conquer: Question-Guided Spatio-Temporal Contextual Attention for Video Question Answering. In

    AAAI Conference on Artificial Intelligence (AAAI)

    ,
    Cited by: §2.
  • [27] W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijayanarasimhan, F. Viola, T. Green, T. Back, P. Natsev, M. Suleyman, and A. Zisserman (2017) The Kinetics Human Action Video Dataset. In arXiv:1705.06950, Cited by: §2, §4.
  • [28] S. Kim, S. Jeong, E. Kim, I. Kang, and N. Kwak (2021) Self-supervised Pre-training and Contrastive Representation Learning for Multiple-choice Video QA. In AAAI Conference on Artificial Intelligence (AAAI), Cited by: §2.
  • [29] R. Krishna, K. Hata, F. Ren, L. Fei-Fei, and J. C. Niebles (2017) Dense-Captioning Events in Videos. In International Conference on Computer Vision (ICCV), Cited by: §1, §2.
  • [30] R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. Kalantidis, L. Li, D. A. Shamma, M. S. Bernstein, and F. Li (2017) Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations. In International Journal of Computer Vision (IJCV), Cited by: §2.
  • [31] Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and R. Soricut (2020)

    ALBERT: A Lite BERT for Self-supervised Learning of Language Representations

    .
    In International Conference for Learning Representations (ICLR), Cited by: §2.
  • [32] T. M. Le, V. Le, S. Venkatesh, and T. Tran (2020) Hierarchical Conditional Relation Networks for Video Question Answering. In Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
  • [33] J. Lei, T. L. Berg, and M. Bansal (2021) QVHighlights: Detecting Moments and Highlights in Videos via Natural Language Queries. In Conference on Neural Information Processing Systems (NeurIPS), Cited by: §2.
  • [34] J. Lei, L. Li, L. Zhou, Z. Gan, T. L. Berg, M. Bansal, and J. Liu (2021) Less is More: ClipBERT for Video-and-Language Learning via Sparse Sampling. In Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §C.1, §C.1, §2, §5, Table 8, Table 9, §6.
  • [35] J. Lei, L. Yu, M. Bansal, and T. L. Berg (2018) TVQA: Localized, Compositional Video Question Answering. In

    Conference on Empirical Methods in Natural Language Processing (EMNLP)

    ,
    Cited by: §2.
  • [36] J. Lei, L. Yu, T. L. Berg, and M. Bansal (2020) TVQA+: Spatio-Temporal Grounding for Video Question Answering. In Annual Meeting of the Association for Computational Linguistics (ACL), Cited by: §2.
  • [37] J. Lei, L. Yu, T. L. Berg, and M. Bansal (2020) TVR: A Large-Scale Dataset for Video-Subtitle Moment Retrieval. In European Conference on Computer Vision (ECCV), Cited by: §2.
  • [38] D. Li, J. Li, H. Li, J. C. Niebles, and S. C.H. Hoi (2022) Align and Prompt: Video-and-Language Pre-training with Entity Prompts. In Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §5, Table 8, Table 9, §6.
  • [39] L. Li, Y. Chen, Y. Cheng, Z. Gan, L. Yu, and J. Liu (2020) HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training. In Conference on Empirical Methods in Natural Language Processing (EMNLP), Cited by: Appendix A, §1, §2, §2, Table 9.
  • [40] L. Li, Z. Gan, K. Lin, C. Lin, Z. Liu, C. Liu, and L. Wang (2022) LAVENDER: unifying video-language understanding as masked language modeling. arXiv preprint arXiv:2206.07160. Cited by: §C.2.
  • [41] L. Li, J. Lei, Z. Gan, L. Yu, Y. Chen, R. Pillai, Y. Cheng, L. Zhou, X. E. Wang, W. Y. Wang, T. L. Berg, M. Bansal, J. Liu, L. Wang, and Z. Liu (2021) VALUE: A Multi-Task Benchmark for Video-and-Language Understanding Evaluation. In Conference on Neural Information Processing Systems (NeurIPS), Cited by: §2.
  • [42] K. Lin, L. Li, C. Lin, F. Ahmed, Z. Gan, Z. Liu, Y. Lu, and L. Wang (2022) SwinBERT: End-to-End Transformers with Sparse Attention for Video Captioning. In Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §C.3, Table 8, §6.
  • [43] S. Liu, H. Fan, S. Qian, Y. Chen, W. Ding, and Z. Wang (2021) HiT: Hierarchical Transformer with Momentum Contrast for Video-Text Retrieval. In arXiv:2103.15049, Cited by: §2.
  • [44] Y. Liu, S. Albanie, A. Nagrani, and A. Zisserman (2020) Use What You Have: Video Retrieval Using Representations From Collaborative Experts. In British Machine Vision Conference (BMVC), Cited by: §2.
  • [45] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov (2019) RoBERTa: A Robustly Optimized BERT Pretraining Approach. In arXiv:1907.11692, Cited by: §2.
  • [46] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo (2021) Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. In International Conference on Computer Vision (ICCV), Cited by: Table 10, Appendix B, §3.3, Table 1, §4, §4, Table 3, §5, Table 6, Table 7.
  • [47] Z. Liu, J. Ning, Y. Cao, Y. Wei, Z. Zhang, S. Lin, and H. Hu (2022) Video Swin Transformer. In Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: Table 10, Table 13, §3.3, Table 1, §4, §4.
  • [48] I. Loshchilov and F. Hutter (2019) Decoupled Weight Decay Regularization. In International Conference for Learning Representations (ICLR), Cited by: Appendix B, Appendix C.
  • [49] H. Luo, L. Ji, M. Zhong, Y. Chen, W. Lei, N. Duan, and T. Li (2021) CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval. In arXiv:2104.08860, Cited by: Table 9.
  • [50] Microsoft Azure. Note: https://azure.microsoft.com/ Cited by: Appendix C.
  • [51] A. Miech, J. Alayrac, L. Smaira, I. Laptev, J. Sivic, and A. Zisserman (2020) End-to-End Learning of Visual Representations from Uncurated Instructional Videos. In Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
  • [52] A. Miech, D. Zhukov, J. Alayrac, M. Tapaswi, I. Laptev, and J. Sivic (2019) HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips. In International Conference on Computer Vision (ICCV), Cited by: §C.1, §1, §2.
  • [53] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Köpf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala (2019)

    PyTorch: An Imperative Style, High-Performance Deep Learning Library

    .
    In Conference on Neural Information Processing Systems (NeurIPS), Cited by: Appendix B.
  • [54] M. Patrick, P. Huang, Y. Asano, F. Metze, A. Hauptmann, J. Henriques, and A. Vedaldi (2021) Support-set bottlenecks for video-text representation learning. In International Conference for Learning Representations (ICLR), Cited by: §2.
  • [55] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever (2021) Learning Transferable Visual Models From Natural Language Supervision. In

    International Conference on Machine Learning (ICML)

    ,
    Cited by: §2, §3.3, Table 1, §4, Table 6.
  • [56] A. Ramesh, M. Pavlov, G. Goh, S. Gray, C. Voss, A. Radford, M. Chen, and I. Sutskever (2021) Zero-Shot Text-to-Image Generation. In International Conference on Machine Learning (ICML), Cited by: §3.3, Table 1, §4, Table 6.
  • [57] R. Ranftl, A. Bochkovskiy, and V. Koltun (2021) Vision Transformers for Dense Prediction. In International Conference on Computer Vision (ICCV), Cited by: §3.3, Table 1, §4, Table 6.
  • [58] J. Rasley, S. Rajbhandari, O. Ruwase, and Y. He (2020) Deepspeed: system optimizations enable training deep learning models with over 100 billion parameters. In KDD, Cited by: Appendix C.
  • [59] A. Rohrbach, M. Rohrbach, N. Tandon, and B. Schiele (2015) A Dataset for Movie Description. In Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: Figure 5, §C.1, §C.2, Appendix C, §1, §2, Table 9.
  • [60] A. Rouditchenko, A. Boggust, D. Harwath, B. Chen, D. Joshi, S. Thomas, K. Audhkhasi, H. Kuehne, R. Panda, R. Feris, B. Kingsbury, M. Picheny, A. Torralba, and J. Glass (2021) AVLnet: Learning Audio-Visual Language Representations from Instructional Videos. In INTERSPEECH, Cited by: §2.
  • [61] P. H. Seo, A. Nagrani, A. Arnab, and C. Schmid (2022) End-to-end Generative Pretraining for Multimodal Video Captioning. In Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: Table 8, §6.
  • [62] P. Sharma, N. Ding, S. Goodman, and R. Soricut (2018) Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning. In Annual Meeting of the Association for Computational Linguistics (ACL), Cited by: Appendix B, §5, Table 6, Table 7, Table 8, §6, §6.
  • [63] C. Sun, A. Myers, C. Vondrick, K. Murphy, and C. Schmid (2019) VideoBERT: A Joint Model for Video and Language Representation Learning. In International Conference on Computer Vision (ICCV), Cited by: §1, §2.
  • [64] H. Tan and M. Bansal (2019) LXMERT: Learning Cross-Modality Encoder Representations from Transformers. In Conference on Empirical Methods in Natural Language Processing (EMNLP), Cited by: §2.
  • [65] H. Tan, J. Lei, T. Wolf, and M. Bansal (2021) VIMPAC: Video Pre-Training via Masked Token Prediction and Contrastive Learning. In arXiv:2106.11250, Cited by: §2, 2nd item.
  • [66] Z. Teed and J. Deng (2020) RAFT: Recurrent All-Pairs Field Transforms for Optical Flow. In European Conference on Computer Vision (ECCV), Cited by: Figure 2, Appendix A, §3.3, Table 1, §4.
  • [67] Z. Tong, Y. Song, J. Wang, and L. Wang (2022) VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training. In arXiv:2203.12602, Cited by: §2.
  • [68] A. Torabi, N. Tandon, and L. Sigal (2016) Learning Language-Visual Embedding for Movie Understanding with Natural-Language. In arXiv:1609.08124, Cited by: §C.2, Table 15, Appendix C, Table 8.
  • [69] A. van den Oord, O. Vinyals, and K. Kavukcuoglu (2017) Neural Discrete Representation Learning. In Conference on Neural Information Processing Systems (NeurIPS), Cited by: §3.3.
  • [70] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention Is All You Need. In Conference on Neural Information Processing Systems (NeurIPS), Cited by: §2.
  • [71] A. J. Wang, Y. Ge, R. Yan, Y. Ge, X. Lin, G. Cai, J. Wu, Y. Shan, X. Qie, and M. Z. Shou (2022) All in One: Exploring Unified Video-Language Pre-training. In arXiv:2203.07303, Cited by: Table 8, Table 9.
  • [72] L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, and L. V. Gool (2016) Temporal Segment Networks: Towards Good Practices for Deep Action Recognition. In European Conference on Computer Vision (ECCV), Cited by: §2.
  • [73] R. Wang, D. Chen, Z. Wu, Y. Chen, X. Dai, M. Liu, Y. Jiang, L. Zhou, and L. Yuan (2022) BEVT: BERT Pretraining of Video Transformers. In Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
  • [74] X. Wang, J. Wu, J. Chen, L. Li, Y. Wang, and W. Y. Wang (2019) VATEX: A Large-Scale, High-Quality Multilingual Dataset for Video-and-Language Research. In International Conference on Computer Vision (ICCV), Cited by: §2.
  • [75] C. Wei, H. Fan, S. Xie, C. Wu, A. Yuille, and andChristoph Feichtenhofer (2022) Masked Feature Prediction for Self-Supervised Visual Pre-Training. In arXiv:2112.09133, Cited by: §1, §2, §3.3, §3.3, §3.3.
  • [76] L. Wei, L. Xie, W. Zhou, H. Li, and Q. Tian (2022) MVP: Multimodality-guided Visual Pre-training. In arXiv:2203.05175, Cited by: §1, §2.
  • [77] S. Xie, C. Sun, J. Huang, Z. Tu, and K. Murphy (2018) Rethinking Spatiotemporal Feature Learning: Speed-Accuracy Trade-offs in Video Classification. In European Conference on Computer Vision (ECCV), Cited by: §2.
  • [78] Z. Xie, Z. Zhang, Y. Cao, Y. Lin, J. Bao, Z. Yao, Q. Dai, and H. Hu (2022) SimMIM: A Simple Framework for Masked Image Modeling. In Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §2, §3.3.
  • [79] D. Xu, Z. Zhao, J. Xiao, F. Wu, H. Zhang, X. He, and Y. Zhuang (2017) Video Question Answering via Gradually Refined Attention over Appearance and Motion. In ACM Multimedia (ACMMM), Cited by: §C.2, §C.2, Table 15, Appendix C, §1, §2, Table 8.
  • [80] J. Xu, T. Mei, T. Yao, and Y. Rui (2016) MSR-VTT: A Large Video Description Dataset for Bridging Video and Language. In Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: Figure 3, §C.1, §C.2, §C.3, Appendix C, §1, §2, Table 8, Table 9.
  • [81] A. Yang, A. Miech, J. Sivic, I. Laptev, and C. Schmid (2021) Just Ask: Learning to Answer Questions from Millions of Narrated Videos. In International Conference on Computer Vision (ICCV), Cited by: §2, Table 8.
  • [82] Z. Yang, N. Garcia, C. Chu, M. Otani, Y. Nakashima, and H. Takemura (2020) BERT Representations for Video Question Answering. In Winter Conference on Applications of Computer Vision (WACV), Cited by: §2.
  • [83] Y. Yu, J. Kim, and G. Kim (2018) A Joint Sequence Fusion Model for Video Question Answering and Retrieval. In European Conference on Computer Vision (ECCV), Cited by: §C.2, Table 15, Appendix C, Table 8.
  • [84] R. Zellers, X. Lu, J. Hessel, Y. Yu, J. S. Park, J. Cao, A. Farhadi, and Y. Choi (2021) MERLOT: Multimodal Neural Script Knowledge Models. In Conference on Neural Information Processing Systems (NeurIPS), Cited by: §1, §2, 3rd item, Table 8.
  • [85] B. Zhang, H. Hu, and F. Sha (2018) Cross-Modal and Hierarchical Modeling of Video and Text. In European Conference on Computer Vision (ECCV), Cited by: §2.
  • [86] J. Zhou, C. Wei, H. Wang, W. Shen, C. Xie, A. Yuille, and T. Kong (2012) iBOT: Image BERT Pre-Training with Online Tokenizer. In International Conference on Learning Representations (ICLR), Cited by: §1, §2, §3.1.
  • [87] L. Zhou, C. Xu, and J. J. Corso (2018) Towards Automatic Learning of Procedures from Web Instructional Videos. In AAAI Conference on Artificial Intelligence (AAAI), Cited by: §2.
  • [88] L. Zhu and Y. Yang (2020) ActBERT: Learning Global-Local Video-Text Representations. In Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §2.