In recent years, one of the major research streams is pre-training of a foundation model (, BERT [BERT]
, GPT-3[GPT3], and CLIP [CLIP]) followed by transfer learning to multiple downstream tasks. Following that, pre-training multi-modal representation (, MIL-NCE [mil-nce] and HERO [HERO]) for videos is also widely studied using large-scale datasets. However, due to the lack of a benchmark, algorithms of pre-training multi-modal representation for videos rely on different downstream tasks, making the comparison of algorithms difficult.
Motivated by this, the VALUE benchmark (or challenge) is proposed to measure the generalization ability and versatility of a multi-modal pre-trained model for videos. There are two critical characteristics of the VALUE benchmark compared to other benchmarks. First, it comprehensively measures the generalization ability of models over the popular video and language understanding tasks, including video retrieval, video question answering, and video captioning, as presented in Figure 1. These three macro-tasks are defined by eleven tasks based on 6 widely used datasets (TV [tvqa], HowTo100M [howto100m], YouCook2 [youcook2], VATEX [vatex], VIOLIN [violin], and VLEP [vlep]) covering a broad range of video genres, lengths and data volume. Second, videos addressed in the VALUE benchmark have multi-modal video inputs, including frames, audio, and textual information. In contrast, most of the existing works tend to focus only on visual cues. Therefore, the VALUE benchmark deals with multi-modal multi-task video data.
In this report, we describe our three winning strategies for the VALUE challenge. The first strategy is single model optimization. Since each task has different behavior, it would be sub-optimal to use the same training configurations for the individual task. Through extensive experiments, we found the best-performing training configurations for each task, and that was to identify the best combination of visual features and fine-tuning strategies. In addition, we exploit visual concepts (objects and associated attributes) as an auxiliary source and combine them with the global clip-level visual representations. It allows our model to leverage rich and fine-grained visual information. Finally, we adopt the task-aware ensemble strategy. Since the output format of a model varies depending on the task, we apply different ensemble strategies that are specialized for each task. Through these three strategies, we ranked first place on the VALUE and QA phases.
2 Our Approach
In this section, we introduce our winning solution in detail. Our base model is HERO [HERO], and we begin with the starter code111https://github.com/VALUE-Leaderboard/StarterCode provided by the competition organizers. Based on this, we focus on the fine-tuning stage for individual tasks and improve the performance through the following three strategies: 1) single model optimization, 2) transfer learning with visual concepts, and 3) task-aware ensemble. We opt for the best model based on the validation score, assuming that the distribution of validation and test sets are similar. We illustrate the entire process of our pipeline in Figure 2.
2.1 Revisiting VALUE Challenge
The VALUE benchmark is a collection of video-and-language datasets on multi-channel videos (, video and subtitle) across various video domains and genres. The benchmark contains 11 tasks (TVR, How2R, YC2R, VATEX-EN-R, TVQA, How2QA, VIOLIN, VLEP, TVC, YC2C, and VATEX-EN-C) each of which belongs to one of 3 video-and-language macro-tasks—(a) text-based video retrieval (Retrieval), (b) video question answering (QA) and (c) video captioning (Captioning)—as illustrated in Figure 1. In this competition, the raw videos are not provided due to the license issue. Instead, the competition organizers provided raw text data (, subtitles) and eight different types of clip-level visual features (ResNet+SlowFast, ResNet+MIL-NCE, Clip-ViT+SlowFast, Clip-ViT+MIL-NCE, ResNet, SlowFast, Clip-ViT, MIL-NCE) extracted from different pre-trained models (ResNet [resnet], SlowFast [slowfast], Clip-ViT [clip-vit], MIL-NCE [mil-nce]). Note that the plus (+) mark indicates the concatenation for multiple features.
Under the constraints, we choose the HERO model as our baseline due to the following two reasons: First, the starter code based on HERO provides detailed training configurations for each task. Second, we encounter enormous data loss when downloading raw videos from YouTube due to various reasons (, broken URLs, blocked regions, etc.); it makes the pre-training of any other algorithm from raw videos intractable. HERO follows the two-stage training: (1) self-supervised learning on large-scale data, and (2) supervised fine-tuning stage on relatively smaller-scale data. Given limited resources and time, we set our objective to improve the fine-tuning stage because pre-training a single HERO model takes three weeks, even with 16 GPUs as the original paper stated. However, we believe that pre-training of HERO could be more important and influential than fine-tuning for achieving a higher score in the competition. We refer interested readers to[HERO] for more details about the structure of HERO and the way to apply it for different tasks.
2.2 Single Model Optimization
Assuming our model is already pre-trained, we start by optimizing training configurations to fit each of the target tasks best; the training hyper-parameters for individual tasks are summarized in Table 1. In addition, we search for the best combination of visual features and fine-tuning strategies, which will be described below.
Visual feature adaptation.
Note that as discovered by Li [value], although HERO is trained using ResNet+SlowFast visual feature, fine-tuning the model with different types of visual features could improve the performance. Therefore, choosing the visual features working best for the target task becomes vital for the higher score during the fine-tuning stage. Therefore, we perform extensive experiments and identify the best-adapted feature out of eight visual features for individual tasks.
According to the baseline paper of VALUE [value], the authors explored two different ways of fine-tuning: ST and ATST. Specifically, ST (single-task) means that the HERO model is fine-tuned with the target task only, whereas ATST (all-task to single-task) performs multi-task learning over all eleven tasks followed by further fine-tuning for a single target task. We found that the best strategy is not uniform but rather different for each target task, as we describe in Table 2. Based on these observations, we exhaustively search for the best combinations for each task. The best-performing combinations for each task are shown in Table 2.
2.3 Transfer Learning with Visual Concepts
Visual concepts are often used for many vision-and-language tasks [you2016image, yu2017end] to complement the global-level visual cue that are obtained using 2D or 3D CNNs (, ResNet [resnet], SlowFast [slowfast], etc.). Inspired by this, we leverage the visual concepts during the fine-tuning stage of individual downstream tasks as an additional language source. To extract the visual concepts, we employ VinVL [vinvl], due to its good performance, which is a detection model providing visual concepts such as objects and attributes. Given a video, we first sample a frame from individual time intervals of subtitles. Then, VinVL is applied to the sampled set of frames and provides three visual concept labels over the maximum number of 10 regions for each frame. These extracted visual concept labels are fed to the text embedding network (, RoBERTa tokenizer followed by a word embedding layer with positional encoding) of HERO model after being attached to subtitles; the visual concepts of individual regions are separated using [SEP] token. The examples of extracted visual concepts are illustrated in Figure 3.
2.4 Task-aware Ensemble Strategies
Since the output of a model differs depending on the task, different strategies for ensemble need to be established. For example, we can not simply average the confidence scores in the captioning task to use the simplest form of a model ensemble because the captioning model does not output the confidence score but the predicted caption itself. Therefore, we specialize the ensemble strategies for each task.
Bayesian optimization for retrieval.
Given retrieval models, we obtain a list of similarity score matrices 222 For Video Corpus Moment Retrieval (VCMR), we apply non-maximum suppression (NMS) with IoU threshold 0.7 to retrieve max. 100 candidates.
For Video Corpus Moment Retrieval (VCMR), we apply non-maximum suppression (NMS) with IoU threshold 0.7 to retrieve max. 100 candidates.where means a similarity matrix between the text queries and candidate videos for the retrieval model. Then, we find a set of weights to be used for identifying the best combination of retrieval models. Finally, the ensembled score matrix is given by
where = 1. To find the optimal values for , we resort to hyper-parameter tuning by Bayesian optimization333https://github.com/hyperopt/hyperopt. To be specific, given that is obtained by predictions of models, we set abergstra2011algorithms] to determine the optimal values for based on the objective function. We leverage the mean recall (, (R@1+R@5+R@10)/3) as the objective function. We iterate this optimization process over 300 steps.
Training single-layered FC for QA.
Given QA models, our objective is to find the best set of weights . Basically, QA task in VALUE is identical to a simple classification that predicts an answer label , where is the number of candidate answers. In this formulation, the QA models output the confidence scores , where is the score matrix representing the confidence scores for each class outputted by -th model given examples of test set. Instead of using Bayesian optimization, we use a learning-based approach for QA. We convert the problem to find the optimal to learning a single-layered Fully-connected layer (FC) with no bias. Firstly, we collect all model outputs and batchify it to build an input , where is the batch size. Then we train a linear layer (; nn.Linear(, 1), bias=False)) to obtain output . Lastly, we apply the cross-entropy loss with ground-truth label. The strength of the learning-based approach is that it converges fast regardless of how large is used. On the other hand, the disadvantage is evident in that it requires careful hyperparameter tuning tor training.
Consensus-based ranking for captioning.
Given captioning models, we generate a set of captions for an input video where individual models generate a caption with a greedy decoding strategy. Then, we adopt a consensus score for a caption from captioning model, which is defined as an averaged similarity to all the other captions as follows:
where is the similarity between two captions and . We employ five sentence embedding models444https://github.com/UKPLab/sentence-transformers555More specifically, we adopt following 5 models—paraphrase-mpnet-base-v2, stsb-mpnet-base-v2, paraphrase-MiniLM-L3-v2, paraphrase-multilingual-mpnet-base-v2, and paraphrase-TinyBERT-L6-v2.
and the averaged cosine similarities of individual embeddings for captions as the similarity function. Finally, the caption of the highest consensus score is chosen for our final output.
This section discusses the impact of the single model optimization, fine-tuning with visual concepts on individual tasks, and the task-aware ensemble strategies.
3.1 Single Model Performance
As discussed in Section 2.2, we optimize a single model training configuration for 11 tasks individually. The best configurations (, usage of visual concept, application of AT and ST, and the choice of visual feature) and the corresponding best single model performance are summarized in Table 2. We can observe the followings. Firstly, our optimized models outperform the counterparts of VALUE baselines by large margins in most cases, while some of them show slightly lower scores (see M1, M3, M8, and M11). Secondly, the best combinations of (1) visual feature adaptation during the fine-tuning stage and (2) fine-tuning scheme (, ST or ATST) are highly dependent on both domain and task. In general, using Clip-ViT+SlowFast (CS) visual feature with the ATST scheme shows outstanding results. On the other hand, in some cases, a specific type of feature significantly outperforms the others. For example, we found that fine-tuning on the single task with MIL-NCE feature performs best for the YouCook2 dataset (see M3 and M10); it is expected because the videos in YouCook2 and HowTo100M (used for pre-training MIL-NCE) share the domain, implying the importance of in-domain pre-training. Lastly, the task-aware model ensemble further provides performance gain in all tasks except VATEX-EN-R in Retrieval. In addition, Table 3 shows the impact of visual feature adaptation where it provides the performance gain in Retrieval and Captioning macro-tasks (see P1-4 and P9-11). The results from the two tables indicate the importance of configuration optimization for individual tasks.
3.2 Effect of Visual Concepts
As illustrated in Figure 3 and Section 2.3, we extract the visual concepts from raw frames and leverage them with the corresponding subtitles as auxiliary information. Since we have a lot of missing videos that failed to download, a limited portion of videos in each dataset domain (, YouCook2 (89%), VIOLIN (15%), VATEX (76%), HowTo100M (87%), and TV (99%)) is used for the extraction. For training samples from the missing videos, we use the subtitle only without the visual concepts. Table 4
analyzes the effect of using the visual concepts during the fine-tuning. We observe the performance enhancement on five out of eleven tasks (see G4-6, G9, and G11) compared to without employing the visual concepts. We found that fine-tuning with the visual concepts is especially effective for VATEX videos (both retrieval and captioning; G4 and G11) as well as for QA macro-tasks (G5-8). On the other hand, fine-tuning with visual concepts did not improve the performance on the retrieval task. Although the effectiveness of visual concepts depends on many experimental variables (, domains, tasks, # of visual concepts, etc.), it helps to increase the variance between models, which is known to be an essential factor for the compelling model ensemble.
3.3 Model Ensemble
Our submission scores in Table 5 are obtained by model ensemble. Given fine-tuned models trained with various training configurations, we choose the top models sorted by the validation scores for the ensemble. Note that we vary the hyperparameter according to the macro-task because we found a trade-off between computational cost and the performance for the ensemble. We set to 8, 16, and 32 for captioning, QA, and retrieval macro-tasks, respectively. Table 2 shows the validation score of our ensemble model. In most cases, we achieve significant score improvement with a model ensemble by large margins (max. +39.55% and avg. +7.45%). The retrieval task benefits the most from the ensemble, followed by captioning and QA.
We described our winning strategies for the VALUE challenge 2021. To solve the task, we propose three main key ingredients: 1) single model optimization, 2) transfer learning with visual concepts, and 3) task-aware ensemble. Through the proposed strategies, the score is improved step by step as shown through extensive experiments in this report. Our final submission is an ensemble of the number of best performing single models on the validation set, that are trained with various training configurations. Based on our approach, we achieve 1st place on the VALUE and QA phases for the competition.