Multimodal Pretraining for Dense Video Captioning

by   Gabriel Huang, et al.
Université de Montréal

Learning specific hands-on skills such as cooking, car maintenance, and home repairs increasingly happens via instructional videos. The user experience with such videos is known to be improved by meta-information such as time-stamped annotations for the main steps involved. Generating such annotations automatically is challenging, and we describe here two relevant contributions. First, we construct and release a new dense video captioning dataset, Video Timeline Tags (ViTT), featuring a variety of instructional videos together with time-stamped annotations. Second, we explore several multimodal sequence-to-sequence pretraining strategies that leverage large unsupervised datasets of videos and caption-like texts. We pretrain and subsequently finetune dense video captioning models using both YouCook2 and ViTT. We show that such models generalize well and are robust over a wide variety of instructional videos.




Activitynet 2019 Task 3: Exploring Contexts for Dense Captioning Events in Videos

Contextual reasoning is essential to understand events in long untrimmed...

End-to-end Generative Pretraining for Multimodal Video Captioning

Recent video and language pretraining frameworks lack the ability to gen...

Unsupervised Learning from Narrated Instruction Videos

We address the problem of automatically learning the main steps to compl...

Semantic-Aware Pretraining for Dense Video Captioning

This report describes the details of our approach for the event dense-ca...

End-to-end Dense Video Captioning as Sequence Generation

Dense video captioning aims to identify the events of interest in an inp...

Closed ASL Interpreting for Online Videos

Deaf individuals face great challenges in today's society. It can be ver...

M-VAD Names: a Dataset for Video Captioning with Naming

Current movie captioning architectures are not capable of mentioning cha...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

YouTube recently reported that a billion hours of videos were being watched on the platform every day (youtubeblog)

. In addition, the amount of time people spent watching online videos was estimated to grow at an average rate of 32% a year between 2013 and 2018, with an average person forecasted to watch 100 minutes of online videos per day in 2021 


An important reason for this fast-growing video consumption is information-seeking. For instance, people turn to YouTube “hungry for how-to and learning content” oneilhart2018

. Indeed, compared to traditional content format such as text, video carries richer information to satisfy such needs. But as a content media, videos are also inherently more difficult to skim through, making it harder to quickly target the relevant part(s) of a video. Recognizing this difficulty, search engines started showing links to “key moments” within videos in search results, based on timestamps and short descriptions provided by the content creators themselves.

This enables users to get a quick sense of what the video covers, and also to jump to a particular time in the video if so desired. This effort echoes prior work in the literature showing how users of instructional videos can benefit from human-curated meta-data, such as a timeline pointing to the successive steps of a tutorial kim2014crowdsourcing; margulieux2012subgoal; weir2015learnersourcing. Producing such meta-data in an automatic way would greatly scale up the efforts of providing easier information access to videos. This task is closely related to the dense video captioning task considered in prior work zhou2018towards; Zhou2018EndtoEndDV; krishna2017densecaptioning, where an instructional video is first segmented into its main steps, followed by segment-level caption generation.

Groundtruth Varying stiching speeds
Ø-Pretraining Showing other parts
MASS-Pretraining Explaining how to do a stitch
Figure 1: Dense video captioning using ViTT–trained models. For the given video scene, we show the ViTT annotation (Groundtruth) and model outputs (no pretraining and MASS-based pretraining).

To date, the YouCook2 data set zhou2018towards is the largest annotated data set for dense video captioning. It contains annotations for 2,000 cooking videos covering 89 recipes, with per-recipe training / validation split. Restricting to a small number of recipes is helpful for early exploratory work, but such restrictions impose barriers to model generalization and adoption that are hard to overcome. We directly address this problem by constructing a larger and broader-coverage annotated dataset that covers a wide range of instructional topics (cooking, repairs, maintenance, etc.) We make the results of our annotation efforts publicly available as Video Timeline Tags (ViTT)222Available at, consisting of around 8,000 videos annotated with timelines (on average 7.1 segments per video, each segment with a short free-text description).

Using YouCook2 and the new ViTT dataset as benchmarks for testing model performance and generalization, we further focus on the sub-problem of video-segment–level caption generation, assuming segment boundaries are given  hessel2019case; sun2019videobert; luo2020univilm

. Motivated by the high cost of collecting human annotations, we investigate pretraining a video segment captioning model using unsupervised signals – ASR (Automatic Speech Recognition) tokens and visual features from instructional videos, and

unpaired instruction steps extracted from independent sources: Recipe1M marin2019recipe1m+ and WikiHow koupaee2018wikihow. In contrast to prior work that focused on BERT-style pretraining of encoder networks (sun2019videobert; sun2019contrastive), our approach entails jointly pretraining both multimodal encoder and text-based decoder models via MASS-style pretraining song2019mass. Our experiments show that pretraining with either text-only or multi-modal data provides significant gains over no pretraining, on both the established YouCook2 benchmark and the new ViTT benchmark. The results we obtain establish state-of-the-art performance on YouCook2, and present strong performance numbers on the ViTT benchmark. These findings help us conclude that the resulting models generalize well and are quite robust over a wide variety of instructional videos.

2 Related Work

Text-only Pretraining.

Language pretraining models based on the Transformer neural network architecture 

Vaswani2017AttentionIA such as BERT (devlin2018bert), GPT GPT2018, RoBERTa Roberta, MASS song2019mass and ALBERT Lan2020ALBERTAL have achieved state-of-the-art results on many NLP tasks. MASS (song2019mass) has been recently proposed as a joint encoder-decoder pretraining strategy. For sequence-to-sequence tasks, this strategy is shown to outperform approaches that separately pretrain the encoder (using a BERT-style objective) and the decoder (using a language modeling objective). UniLM UniLM, BART Lewis2019BARTDS, and T5 Raffel2019ExploringTL propose unified pretraining approaches for both understanding and generation tasks.

Multimodal Pretraining.

VideoBERT sun2019videobert, CBT sun2019contrastive and ActBERT actbert2020cvpr use a BERT-style objective to train both video and ASR text encoders. Alayrac2016cvpr and Miech2020cvpr

use margin-based loss functions to learn joint representations for video and ASR, and evaluate them on downstream tasks such as video captioning, action segmentation and anticipation, and action localization. An independent and concurrent work (UniViLM) by 

luo2020univilm is closely related to ours in that we share some similar pretraining objectives, some of the pretraining setup – HowTo100M Alayrac2016cvpr, and the down-stream video captioning benchmark using YouCook2 zhou2018towards. The main difference is that they use BERT-style pretraining for encoder and language-modeling style pretraining for decoder, whereas we use MASS-style pre-training to pretrain encoder and decoder jointly.

Other approaches such as ViLBERT (lu2019vilbert), LXMERT (tan2019lxmert), Unicoder-VL (li2019unicoder), VL-BERT (su2019vl), and UNITER (chen2019uniter) focus on pretraining joint representations for text and image, evaluating them on downstream tasks such as visual question answering, image-text retrieval and referring expressions.

Dense Video Captioning.

In this paper, we focus on generating captions at the segment-level, which is a sub-task of the so-called dense video captioning task krishna2017densecaptioning, where fine-grained captions are generated for video segments, conditioned on an input video with pre-defined event segments. This is different from the video captioning models that generate a single summary for the entire video Wang2019VaTeXAL.

hessel2019case make use of ASR and video for segment-level captioning on YouCook2 and show that most of the performance comes from ASR. shi2019dense; luo2020univilm train their dense video captioning models on both video frames and ASR text and demonstrate the benefits of adding ASR as an input to the model. There are also a number of video captioning approaches that do not use ASR directly Zhou2018EndtoEndDV; pan2020cvpr; zheng2020cvpr; zhang2020cvpr; Lei2020MARTMR.

Instructional video captioning data sets.

In addition to YouCook2 zhou2018towards, there are two other smaller data sets in the instructional video captioning category. Epic Kitchen DamenECCV2018 features 55 hours of video consisting of 11.5M frames, which were densely labeled for a total of 39.6K action segments and 454.3K object bounding boxes. How2 how2-dataset consists of instructional videos with video-level (as opposed to segment-level) descriptions, authored by the video creators themselves.

3 Data

We present the datasets used for pretraining, finetuning, and evaluation in Table 1. We also describe in detail the newly introduced dense video captioning dataset, Video Timeline Tags (ViTT).

width= Name Type # segments Pretraining datasets YT8M-cook ASR+video 186 K HowTo100M ASR+video 8.0 M Recipe1M CAP-style 10.8 M WikiHow CAP-style 1.3 M Finetuning datasets YouCook2 ASR+video+CAP 11.5 K ViTT-All ASR+video+CAP 88.5 K

Table 1: Datasets used in this work, along with size of the data measured by the total number of segments.

3.1 Dense Video-Captioning Datasets

Our goal is to generate captions (CAP) for video segments. We consider two datasets with segment-level captions for fine-tuning and evaluating ASR+VideoCAP models.


Up to this point, YouCook2 (zhou2018towards) has been the largest human-annotated dense-captioning dataset of instructional videos publicly available. It originally contained 2,000 cooking videos from YouTube. Starting from 110 recipe types (e.g., “shrimp tempura”), 25 unique videos per recipe type were collected; the recipe types that did not gather enough videos were dropped, resulting in a total of 89 recipe types in the final dataset. In addition, youcook2dataset “randomly split the videos belonging to each recipe into 67%:23%:10% as training, validation and test sets333Note that no annotations are provided for the test split; we conducted our own training/dev/test split over available videos.,” which effectively guarantees that videos in the validation and test sets are never about unseen recipes. Annotators were then asked to construct recipe steps for each video — that is, identify the start and end times for each step, and provide a recipe-like description of each step. Overall, they reported an average of 7.7 segments per video, and 8.8 words per description. After removing videos that had been deleted by users, we obtained a total of 11,549 segments.


One limitation of the YouCook2 dataset is the artificially imposed (almost) uniform distribution of videos over 89 recipes. While this may help making the task more tractable, it is difficult to judge whether performance on its validation / test sets can be generalized to unseen topics.

The design of our ViTT dataset annotation process is aimed at fixing some of these drawbacks. We started by collecting a large dataset of videos containing a broader variety of topics to better reflect topic distribution in the wild. Specifically, we randomly sampled instructional videos from the YouTube-8M dataset (abu2016youtube), a large-scale collection of YouTube videos that also contain topical labels. Since much of prior work in this area revolved around cooking videos, we aimed at sampling a significant proportion of our data from videos with cooking labels (specifically, “Cooking” and “Recipe”). Aside from the intentional bias regarding cooking videos, the rest of the videos were selected by randomly sampling non-cooking videos, including only those that were considered to be instructional videos by our human annotators.

Once candidate videos were identified, timeline annotations and descriptive tags were collected. Our motivation was to enable downstream applications to allow navigating to specific content sections. Therefore, annotators were asked to identify the main steps in a video and mark their start time. They were also asked to produce a descriptive-yet-concise, free-text tag for each step (e.g., “shaping the cookies”, “removing any leftover glass”). A subset of the videos has received more than one complete annotation (main steps plus tags).

The resulting ViTT dataset consists of a total of 8,169 videos, of which 3,381 are cooking-related. A total of 5,840 videos have received only one annotation, and have been designated as the training split. Videos with more than one annotation have been designated as validation / test data. Overall, there are 7.1 segments per video on average (max: 19). Given the dataset design, descriptions are much shorter in length compared to YouCook2: on average there are 2.97 words per tag (max: 16) — 20% of the captions are single-word, 22% are two-words, and 25% are three words. Note that the average caption length is significantly shorter than for YouCook2, which is not surprising given our motivation of providing short and concise timeline tags for video navigation. We standardized the paraphrases among the top-20 most frequent captions. For instance, {“intro”, “introduction”}  “intro”. Otherwise, we have preserved the original tags as-is, even though additional paraphrasing most definitely exists. Annotators were instructed to start and end the video with an opening and closing segment as possible. As a result, the most frequent tag (post-standardization) in the dataset is “intro”, which accounts for roughly 11% of the 88,455 segments. More details on the data collection process and additional analysis can be found in the Supplementary Material (Section A.1).

Overall, this results in 56,027 unique tags, with a vocabulary size of 12,509 token types over 88,455 segments. In this paper, we consider two variants: the full dataset (ViTT-All), and the cooking subset (ViTT-Cooking).

3.2 Pretraining Datasets: ASR+Video

We consider two large-scale unannotated video datasets for pretraining, as described below. Time-stamped ASR tokens were obtained via YouTube Data API,444 and split into ASR segments if the timestamps of two consecutive words are more than 2 seconds apart, or if a segment is longer than a pre-specified max length (in our case, 320 words). They were paired with concurrent video frames in the same segment.


We extract the cooking subset of YouTube-8M (abu2016youtube) by taking, from its training split, videos with “Cooking” or “Recipe” labels, and retain those with English ASR, subject to YouTube policies. After preprocessing, we obtain 186K ASR+video segments with an average length of 64 words (24 seconds) per segment.


This is based on the 1.2M YouTube instructional videos released by miech2019howto100m, covering a broad range of topics. After preprocessing, we obtain 7.99M ASR+video segments with an average of 78 words (28.7 seconds) per segment.

3.3 Pretraining Datasets: CAP-style

We also consider two text-only datasets for pretraining, containing unpaired instruction steps similar in style to the target captions.


is a collection of 1M recipes scraped from a number of popular cooking websites (marin2019recipe1m+). We use the sequence of instructions extracted for each recipe in this dataset, and treat each recipe step as a separate example during pretraining. This results in 10,767,594 CAP-style segments, with 12.8 words per segment.


is a collection of 230,843 articles extracted from the WikiHow knowledge base  (koupaee2018wikihow). Each article comes with a title starting with “How to”. Each associated step starts with a step summary (in bold) followed by a detailed explanation. We extract the all step summaries, resulting in 1,360,145 CAP-style segments, with 8.2 words per segment. Again, each instruction step is considered as a separate example during pretraining.

3.4 Differences between Pretraining and Finetuning Datasets

First, note that video segments

are defined differently for pretraining and finetuning datasets, and may not match exactly. For ASR+Video pretraining datasets, which are unsupervised, the segments are divided following a simple heuristic (e.g., two consecutive words more than 2 seconds apart), whereas for finetuning ASR+Video

CAP datasets, which are supervised, the segments are defined by human annotators to correspond to instruction steps. Otherwise, the ASR data are relatively similar between pretraining and finetuning datasets, as both come from instructional videos and are in the style of spoken language.

Second, compared to the target captions in finetuning datasets, the CAP-like pretraining datasets are similar in spirit — they all represent summaries of steps, but they may differ in length, style and granularity. In particular, the CAP-like pretraining datasets are closer in style to captions in YouCook2, where annotators were instructed to produce a recipe-like description for each step. This is reflected in their similar average length (YouCook2: 8.8 words, Recipe1M: 12.8 words, WikiHow: 8.2 words); whereas captions in ViTT are significantly shorter (2.97 words on average).

Despite these differences — some are inevitable due to the unsupervised nature of pretraining datasets — the pretraining data is very helpful for our task as shown in the experimental results.

4 Method

To model segment-level caption generation, we adopt MASS-style pretraining song2019mass with Transformer vaswani2017attention as the backbone architecture. For both pre-training and fine-tuning objectives, we have considered two variants: text-only and multi-modal. They are summarized in Table 2 and more details are given below.

4.1 Separate-Modality Architecture

Figure 2: A diagram for the separate-modality architecture. It consists of a two-stream (text and video inputs) encoder with cross-modal attention and a text-only decoder, jointly trained using the MASS objective.

Both ASR tokens and video segment features are given as input in the multimodal variants. We consider an architecture with a separate transformer for each modality (text or video), see Figure 2 for details. When available, the text and video encoders attend to each other at every layer using cross-modal attention, as in ViLBERT lu2019vilbert. The text decoder attends over the final-layer output of both encoders. We discuss in more detail the differences between using a separate-modality architecture vs. a vanilla-Transformer approach for all modalities in Appendix A.2.

The inputs to the text encoder is the sum of three components: text token embeddings, positional embeddings and the corresponding style embeddings,555This is similar to the way language-ID embeddings are used in machine translation. depending on the style of the text (ASR or Caption-like). The inputs to the video encoder could be either precomputed frame-level 2D CNN features or 3D CNN features, pretrained on the Kinetics Carreira2017QuoVA; Kay2017TheKH data set. The visual features are projected with fully-connected layers to the same dimension as the text embeddings.

The main architecture we consider is a 2-layer encoder (E2), 6-layer decoder (D6) Transformer. We use E2D6 to refer to the text-only version, and E2vidD6 to refer to the multimodal version with an active video encoder. We also experiment with E2D2 and E2vidD2 (2-layer decoder).666We found in a preliminary study that using 6-layer encoders did not improve performance for our application.

4.2 Pretraining with Text-only MASS

Text-only pretraining is essentially the unsupervised learning of the style transfer between ASR-style and caption-style texts using

unpaired data sources: ASR strings from video segments in YT8M-cook or HowTo100M; and CAP-style instruction steps found in Recipe1M or HowTo100M. Just like using MASS for unsupervised machine translation involves pretraining the model on unpaired monolingual datasets, we alternate between asrasr and capcap MASS steps during our pretraining stage, which does not require the “source” (ASR) and “target” (CAP-style) data to be aligned.

In an asrasr step, we mask a random subsequence of the ASR and feed the masked ASR to the text encoder. The text decoder must reconstruct the hidden subsequence while attending to the encoder output. A capcap step works similarly by trying to reconstruct a masked sequence of a CAP-style text. The encoder and decoder are trained jointly using teacher-forcing on the decoder. We denote this text-only strategy as MASS in the experiments.

4.3 Pretraining with Multimodal MASS

During multimodal pretraining, we alternate between text-only capcap MASS steps and multimodal MASS steps. During each multimodal MASS step asr+videoasr, we feed a masked ASR to the text-encoder and the co-occurring video features to the video-encoder. The text decoder must reconstruct the masked ASR subsequence. We denote this pretraining strategy as MASSvid in the experiments. This trains cross-modal attention between the text-encoder and video-encoder at every layer, jointly with the text decoder that attends to the output layer of both the text and video encoders.777 In preliminary experiments, we had attempted to directly adapt the MASS objective (song2019mass) to video reconstruction — by masking a subsequence of the input video and making the video decoder reconstruct the input using the Noise Constrastive Estimator Loss (sun2019contrastive). Due to limited success, we did not further pursue this approach.

To force more cross-modal attention between encoder and decoder, we also investigate a strategy of hiding the text-encoder output from the decoder for some fraction of training examples. We refer to this strategy as MASSdrop in the experiments.

4.4 Pretraining with Alignment and Ordering Tasks

We also explore encoder-only multimodal pretraining strategies. We take the last-layer representation for the CLS (beginning of sentence) token from the encoder, and add a multi-layer perceptron on top of it for binary predictions (Figure  

2). Given a pair of ASR and video segment, we train the encoder to predict the following objectives:

  • Segment-Level Alignment. An (ASR, video) pair is aligned if they occur in the same pretraining segment; negative examples are constructed by sampling pairs from the same video but at least 2 segments away.

  • Segment-Level Ordering. We sample (ASR, video) pairs that are at least 2 segments away, and train the model to predict whether the ASR occurs before or after the video clip.

During this MASSalign pretraining stage, we alternate between two text-only MASS steps (capcap, asrasr) and the two binary predictions (Alignment and Ordering) described above.

4.5 Finetuning on Video Captioning

For text-only finetuning, we feed ASR to the text encoder and the decoder has to predict the corresponding CAP (asrcap). For multimodal finetuning, we also feed additional video representations to the video encoder (asr+videocap). When finetuning a multimodal model from text-only pretraining, everything related to video (weights in the video encoder and any cross-modal attention modules) will be initialized randomly. In addition to these uni-directional (UniD) finetuning, we also experiment with several variants of bidirectional (BiD) finetuning (Table 2). For instance, adding capasr (predicting ASR from CAP) to text-only finetuning. In the experiments, we find some variants of bidirectional finetuning beneficial whether training from scratch or finetuning from a pretrained model.

width= Pretraining Objectives Name T V InputOutput MASS CAPCAP, ASRASR MASSvid CAPCAP, ASR+videoASR MASSdrop CAPCAP, ASR+videoASR MASSalign CAPCAP, ASRASR, ASR+video Finetuning Objectives Name T V InputOutput UniD ASRCAP BiD ASRCAP, CAPASR UniD ASR+videoCAP BiD ASR+videoCAP, CAPASR BiDalt ASR+videoCAP, CAP+videoASR

Table 2: Pretraining and Fine-tuning objectives. For each strategy, ✓  indicates whether the text (T) and video (V) encoders are active, followed by a summary of training objectives involved in one training step.

5 Experiments

max width= Method Input Pretraining Bleu-4 Meteor Rouge-L CIDEr Constant Pred (hessel2019case) - - 2.70 10.30 21.70 0.15 MART Lei2020MARTMR Video - 8.00 15.90 - 0.36 EMT Zhou2018EndtoEndDV Video - 4.38 11.55 27.44 0.38 CBT sun2019contrastive Video Kinetics + HowTo100M 5.12 12.97 30.44 0.64 AT (hessel2019case) ASR - 8.55 16.93 35.54 1.06 AT+Video (hessel2019case) Video + ASR - 9.01 17.77 36.65 1.12 UniViLM #1 (luo2020univilm) Video - 6.06 12.47 31.48 0.64 UniViLM #2 (luo2020univilm) Video + ASR - 8.67 15.38 35.02 1.00 UniViLM #5 (luo2020univilm) Video + ASR HowTo100M 10.42 16.93 38.02 1.20 Ø Pretraining E2D6-BiD ASR - 7.90 15.70 34.86 0.93 E2vidD6-BiD Video + ASR - 8.01 16.19 34.66 0.91 Text Pretraining E2D6-MASS-BiD ASR YT8M-cook + Recipe1M 10.60 17.42 38.08 1.20 E2vidD6-MASS-BiD Video + ASR YT8M-cook + Recipe1M 11.47 17.70 38.80 1.25 Multimodal Pretraining E2vidD6-MASSalign-BiD Video + ASR YT8M-cook + Recipe1M 11.53 17.62 39.03 1.22 E2vidD6-MASSvid-BiD Video + ASR YT8M-cook + Recipe1M 12.04 18.32 39.03 1.23 E2vidD6-MASSdrop-BiD Video + ASR YT8M-cook + Recipe1M 10.45 17.74 38.82 1.22 Human (hessel2019case) Video + ASR - 15.20 25.90 45.10 3.80

Table 3: Segment-level captioning results on YouCook2. We use YT8M-cook and Recipe1M for pretraining. The numbers for the related work (first group) are directly reported from the corresponding papers. The last line is an estimate of human performance as reported by hessel2019case, and can be taken as a rough upper bound of the best performance achievable.

max width= Method Input ViTT-All ViTT-Cooking Bleu-1 Meteor Rouge-L CIDEr Bleu-1 Meteor Rouge-L CIDEr Constant baseline (“intro”) - 1.42 3.32 11.15 0.28 1.16 2.93 10.21 0.25 Ø Pretraining E2D6-BiD ASR 19.60 9.12 27.88 0.68 20.77 10.08 28.63 0.72 E2vidD6-BiD Video + ASR 19.49 9.23 28.53 0.69 20.45 9.88 28.88 0.69 Text Pretraining E2D6-MASS-BiD ASR 21.93 10.60 30.45 0.79 24.79 12.25 32.40 0.88 E2vidD6-MASS-BiD Video + ASR 22.44 10.83 31.27 0.81 24.22 12.22 32.60 0.89 Multimodal Pretraining E2vidD6-MASSalign-BiD Video + ASR 22.31 10.66 31.13 0.79 24.92 12.25 33.09 0.90 E2vidD6-MASSvid-BiD Video + ASR 22.45 10.76 31.49 0.80 24.87 12.43 32.97 0.90 E2vidD6-MASSdrop-BiD Video + ASR 22.37 11.00 31.40 0.82 24.48 12.22 33.10 0.89 Human Video + ASR 43.34 33.56 41.88 1.26 41.61 32.50 41.59 1.21

Table 4: Segment-level captioning results on ViTT. For ViTT-All we pretrain on HowTo100M and WikiHow; for ViTT-Cooking we pretrain on YT8M-cook and Recipe1M. We report baseline scores for predicting the most common caption “intro”. We also estimate the human performance as a rough upper bound (details in Supplementary Material A.1; Table 9).

5.1 Implementation Details

We tokenize ASR and CAP inputs using byte-pair–encoding subwords (sennrich2015neural), and truncate them to 240 subwords. We truncate video sequences to 40 frames (40 seconds of video), compute the 128-dim features proposed by wang2014learning (which we will refer to as Compact 2D features), and project them to the embedding space using a two-layer perceptron with layer normalization and GeLU activations.

We instantiate the E2xDx models defined in Section 4.1

with 128-dimensional embeddings and 8 heads respectively for self-attention, encoder-decoder, and cross-modal attention modules. We define each epoch to be 3,125 iterations, where each iteration contains one repetition of each training step as represented in Table 

2. We pretrain for 200 epochs and finetune for 30 epochs.

For evaluation, we consider Bleu-4 (papineni2002bleu), Meteor (denkowski2014meteor), Rouge-L (lin2004automatic) and CIDEr (vedantam2015cider) metrics.

Please refer to Appendix A.3

for full implementation details, hyperparameters and computation cost.

Notes on ViTT evaluation:

With the exception of Rouge-L, all other metrics are sensitive to short groundtruth. 67% of the groundtruth tags in ViTT have less than 4 words, where a perfect prediction will not yield a full score in, say, Bleu-4. Thus, we focus mainly on Rouge-L, report Bleu-1 instead of Bleu-4 for ViTT, and provide the other two metrics only as reference points.

We had originally decided to use videos with multiple annotations as validation and test data, so that we could explore evaluation with multiple reference groundtruth captions. But as annotators do not always yield the same set of segment boundaries, this became tricky. Instead, we simply treat each segment as a separate instance with one single reference caption. Note that all segments annotated for the same video will be in either validation or test to ensure no content overlap.

5.2 Main Results

We run several variants of our method on YouCook2, ViTT-All and ViTT-Cooking, using different architectures, modalities, pretraining datasets, pretraining and finetuning strategies.

Comparing with other methods on YouCook2

For YouCook2, we report our method alongside several methods from the literature (hessel2019case; sun2019videobert; Zhou2018EndtoEndDV; Lei2020MARTMR), as well as state-of-the-art concurrent work (luo2020univilm). The related work is provided for reference and to give a ballpark estimate of the relative performance of each method, but results are not always strictly and directly comparable. Beyond the usual sources of discrepancy in data processing, tokenization, or even different splits, an additional source of complication comes from the fact that videos are regularly deleted by content creators, causing video datasets to shrink over time. Additionally, when comparing to other work incorporating pretraining, we could differ in (videos available in) pretraining datasets, segmentation strategies, etc. To this end, we perform an extensive ablation study, which at least helps us to understand the effectiveness of different components in our approach.

Effect of pretraining

The main experimental results for the three datasets we consider are summarized in Table 3 (YouCook2) and Table 4 (ViTT-All and ViTT-Cooking). Across all three datasets, the best performance is achieved by finetuning a multimodal captioning model under the Multimodal Pretraining condition. For instance, on YouCook2, E2vidD6-MASSvid-BiD improves over the no-pretraining model E2vidD6-BiD by 4.37 Rouge-L, a larger improvement than UniViLM with pretraining (#5) vs without (#2) luo2020univilm. This improvement also holds in ViTT-Cooking (+4.22 in Rouge-L) and ViTT-All (+2.97 in Rouge-L). We do not observe consistent and significant trends among the different multimodal pretraining strategies: MASS pretraining with video (MASSvid), with video and droptext (MASSdrop), or with alignment tasks (MASSalign).888Limited improvement with MASSalign suggests that such alignment tasks are better suited for retrieval (luo2020univilm). Furthermore, we observe that most of the pretraining improvement is achievable via text-only MASS pretraining. Across all three datasets, while Multimodal Pretraining (E2vidD6-MASSvid-BiD) is consistently better than Text Pretraining (E2vidD6-MASS-BiD), the differences are quite small (under one Rouge-L point).

It’s worthy noting that for MASSalign, the best validation accuracies for the pretraining tasks are reasonably high: for YT8M-cook, we observed 90% accuracy for the alignment task, and 80% for the ordering task (for HowTo100M: 87% and 71.4%, respectively), where random guess would yield 50%. This suggests that our video features are reasonably strong, and the findings above are not due to weak visual representations.

max width= Method Bleu-4 Meteor Rouge-L CIDEr D2-UniD 10.84 17.39 38.24 1.16 D6-UniD 11.39 18.00 38.71 1.22 D2-BiD 11.38 18.04 38.67 1.19 D6-BiD 11.47 17.70 38.80 1.25 D6-BiDalt 11.07 17.68 38.43 1.22 D6-BiD (S3D) 11.64 18.04 38.75 1.24

Table 5: Ablation study on YouCook2. We finetune a multimodal captioning model (E2vid) with either 2-layer decoder (D2) or 6-layer decoder (D6) using YT8M-cook/Recipe1M for MASS pretraining, combined with either unidirectional (UniD) or bidirectional (BiD) finetuning. We find no significant difference between using 2D and 3D features (marked as S3D).

Effect of other modeling choices

We experiment with 2-layer decoder (D2) vs 6-layer decoder (D6), combined with either unidirectional fine-tuning (UniD) or bidirectional fine-tuning (BiD). Table 5 shows ablation results of the four possible combinations when finetuning a multimodal model using text-only pretraining on YouCook2 (a more complete list of results can be found in Appendix A.5, showing similar trends). The D6xBiD combination tends to yield the best performance, with the differences among the four configurations being relatively small (under one Rouge-L point). For visual features, we also explored using 3D features (xie2018rethinking) instead of 2D features during finetuning (with no pretraining or text-only pretraining), and do not find much difference in model performance on YouCook2. As a result, we use the simpler 2D features in our multimodal pretraining. We leave more extensive experiments with visual features as future work.

Generalization implications

An important motivation for constructing the ViTT dataset and evaluating our models on it has been related to generalization. Since the YouCook2 benchmark is restricted to a small number of cooking recipes, there is little to be understood about how well models trained and evaluated on it generalize. In contrast, the ViTT benchmark has a much wider coverage (for both cooking-related videos and general instructional videos), and no imposed topic overlap between train/dev/test. As such, there are two findings here that are relevant with respect to generalization: (a) the absolute performance of the models on the ViTT benchmark is quite high (ROUGE-L scores above 0.30 are usually indicative of decent performance), and (b) the performance on ViTT vs. YouCook2 is clearly lower (31.5 ROUGE-L vs. 39.0 ROUGE-L, reflecting the increased difficulty of the new benchmark), but it is maximized under similar pretraining and finetuning conditions, which allows us to claim that the resulting models generalize well and are quite robust over a wide variety of instructional videos.

6 Conclusions

Motivated to improve information-seeking capabilities for videos, we have collected and annotated a new dense video captioning dataset, ViTT, which is larger with higher diversity compared to YouCook2. We investigated several multimodal pretraining strategies for segment-level video captioning, and conducted extensive ablation studies. We concluded that MASS-style pretraining is the most decisive factor in improving the performance on all the benchmarks used. Even more to the point, our results indicate that most of the performance can be attributed to leveraging the ASR signal. We achieve new state-of-the-art results on the YouCook2 benchmark, and establish strong performance baselines for the new ViTT benchmark, which can be used as starting points for driving more progress in this direction.


We send warm thanks to Ashish Thapliyal for helping the first author debug his code and navigate the computing infrastructure, and to Sebastian Goodman for his technical help (and lightning fast responses!). We also thank the anonymous reviewers for their comments and suggestions.


Appendix A Appendix

Supplementary Material for “Multimodal Pretraining for Dense Video Captioning”.

a.1 The ViTT dataset

Sampling video for annotation.

The goal of the ViTT dataset design is to mirror topic distribution in the “wild”. Therefore, instead of starting from specific how-to instructions and searching for corresponding videos, we sampled videos from the validation set of the YouTube-8M dataset (abu2016youtube), a large-scale collection of YouTube videos with topical labels, subject to YouTube policies.

Exclusion criteria were lack of English ASR and the topic label “Game”. The latter was motivated by the fact that in this type of videos, the visual information predominantly features video games, while the ViTT dataset was intended to contain only videos with real-world human actions. Cooking videos can be easily identified by sampling videos that came with “Cooking” or “Recipe” topic labels. Given the convenience and the fact that much of prior work in this area had focused on cooking videos, approximately half of the dataset was designed to include cooking videos only, while the remaining videos would be randomly sampled non-cooking videos, as long as they were verified as instructional by human annotators.

Annotation process

Annotators were presented with a video alongside its timestamped, automatic transcription shown in sentence-length paragraphs. They were asked to watch the video and first judge whether the video was instructional. For the purpose of our dataset, we determine that a video is instructional if it focuses on real-world human actions that are accompanied by procedural language explaining what is happening on screen, in reasonable details. Also for our purposes, instructional videos need to be grounded in real life, with a real person in the video exemplifying the action being verbally described.

For videos judged to be instructional, annotators were then asked to:

  • Delimit the main segments of the video.

  • Determine their start time if different from the automatically suggested start time (explained below).

  • Provide a label summarizing or explaining the segment.

Annotation guidelines

Annotators were instructed to identify video segments with two potential purposes:

  • Allow viewers to jump straight to the start of a segment for rewatch.

  • Present viewers with an index to decide whether to watch the video in full or directly skip to the segment of interest.

Our guidelines suggested a range of five to ten segments as long as the the structure and content of the video permitted. For short videos, the direction was to prioritize quality over quantity and to only define those segments that formed the narrative structure of the video, even if the resulting number of segments was below 5.

To help annotators determine segment start times, transcriptions were shown in “sentences” — we expected that sentence start times might be good candidates for segment start times. We obtained sentence boundaries automatically as follows. Given the stream of timestamped ASR tokens for a video, we first separated them into blocks by breaking two consecutive tokens whenever they were more than 2 seconds apart. We then used a punctuation prediction model to identify sentence boundaries in each resulting block. Each sentence was shown with the timestamp corresponding to its first token. Annotators were advised that transcriptions had been automatically divided into paragraphs that may or may not correspond to a video segment — if they decided that a segment started from a particular sentence, they could choose to use the start time of the sentence as the start time for the segment, or, if needed, they could put in an adjusted start time instead.

Once the start time had been identified, annotators were asked to provide a free-text label to summarize each segment. We instructed the annotators to use nouns or present participles (-ing form of verbs) to write the labels for the video segments, whenever possible. Additionally, we asked that the labels be succinct while descriptive, using as few words as possible to convey as much information as possible.

Data statistics and post-processing

The resulting dataset consists of 8,169 instructional videos that received segment-level annotations, of which 3,381 are cooking-related. Overall there are an average of 7.1 segments per video (max: 19). Given our instructions, the descriptions are much shorter in lengths compared to a typical captioning dataset: on average there are 2.97 words per description (max: 16); 20% of the captions are single-word, 22% are two-words, and 25% are three words. We refer to these descriptions as “tags” given how short they are.

When possible, annotators were also asked to start and end the video with an opening and closing segment. As a result, most annotations start with an introduction segment: this accounts for roughly 11% of the 88455 segments in the dataset (“intro”: 8%, “introduction”: 2.3%). Note that while “intro” and “introduction” are clearly paraphrases of each other, an automatic metric will penalize a model predicting “intro” when the groundtruth is “introduction”. Similarly, the ending segment was described in several varieties: “outro”: 3.4%, “closing”: 1%, “closure”, “conclusion”, “ending”, “‘end of video”: each under 1%. Penalizing paraphrases of the ground truth is an inherent weakness of automatic metrics. To mitigate this, we decided to reduce the chance of this happening for the most frequent tags in the dataset. That is, in our experiments, we identified three groups of tags among the top-20 most frequent tags, and standardized them as follows.

intro intro, introduction, opening
outro outro, closing, closure, conclusion,
ending, end of video, video closing
result finished result, final result, results
Table 6: Standardization of top tags

Note that this does not mean we can solve this problem as a classification task like in visual question answering (VQA): overall, there are 56,027 unique tags with a vocabulary size of 12,509 for the 88,455 segments; 51,474 tags appeared only once in the dataset, making it infeasible to reduce the segment-level captioning problem into a pure classification task. Table 7 shows the top 10 most frequent tags after standardization.

Tag % of segments
intro 11.4
outro 6.6
result 0.9
ingredients 0.8
listing ingredients 0.2
supplies 0.2
mixing ingredients 0.2
materials 0.1
what you’ll need 0.1
lining the eyes 0.1
Table 7: 10 most frequent tags after standardization.

Estimate of human performance.

A subset of the candidate videos were given to three annotators999A small set were unintentionally given to six annotators.

, to help us understand variations in human annotations. 5,840 videos received dense captioning from exactly one annotator and were used as training data. Videos with more than one annotation were used as validation / test data. Note that not all the videos with multiple timeline annotations have exactly three sets of them — in fact, 1368 videos received 3-way segment-level annotations. This is because not all annotators agreed on whether a video was instructional. Computing annotator agreement for the annotated timelines is non-trivial. Here we focus on an estimate of tagging agreement when a pair of annotators agreed over the segment start time. Specifically, we go through each video that received multiple segment-level annotations. For each segment where two annotators chose the same ASR sentence as its starting point, we take the tags they produced for this segment and consider one of them as groundtruth, the other as prediction, and add that into our pool of (groundtruth, prediction) pairs. We can then compute standard automatic evaluations metrics over this pool. The results are as follows.

43.34 33.56 41.88 1.26
Table 8: Estimate of human performance for the segment-level captioning on ViTT-All (computed over 7528 pairs).
41.61 32.50 41.59 1.21
Table 9: Estimate of human performance for the segment-level captioning on ViTT-Cooking (computed over 2511 pairs).

Note that METEOR, and CIDEr scores are both penalized by the lack of n-grams for higher n. That is, when both groundtruth and prediction are single-word, say, “intro”, this pair will not receive a full score from any of these metrics. But the

Rouge-L score is in the same ballpark as estimate of human performance in prior work hessel2019case. One might note that perhaps this pool of label pairs contains a higher share of “intro”, since annotators might be more likely to agree over where an opening segment starts. Indeed, 20% of the time, one of the tags is “intro”. Interestingly, in spite of standardization of top tags, 14% of the time one tag is “intro”, the other tag is not “intro”: they can be less frequent paraphrases (e.g., “welcoming”, “greeting”, “opening and welcoming”) or something semantically different (e.g., “using dremel tool”).

a.2 Separated vs. Concatenated-Modality Architecture

Prior work has explored both concatenating different modalities and feeding them into the same multimodal Transformer encoder (sun2019videobert; hessel2019case), as well as separating them into unimodal transformers (sun2019contrastive; lu2019vilbert). We opt for the separated architecture because it offers more flexibility. First, the concatenated architecture requires embedding the text and video features into the same space. When the video features are projected using a simple network, there is no guarantee that we can meaningfully project them into the text embedding space. VideoBERT (sun2019videobert) gives more flexibility to the video embeddings by quantizing video features and learning an embedding for each codeword. However, the quantization step has subsequently been claimed to be detrimental (sun2019contrastive). Moreover, the concatenated architecture uses the same sets of forward and attention weights to process text and video, and performs layer normalization jointly between the two modalities, which is not necessarily meaningful. Finally, the separated architecture makes it easy to switch between variable length text-only, video-only, or text+video modalities, whereas concatenated architectures might rely on separating tokens, modalities embeddings, and using fixed sequence lengths (luo2020univilm).

a.3 Additional Implementation Details 

We optimize all models on a nVidia v100 GPU using the Adam optimizer with inverse square root schedule, batch size 32, warm-up period of 4,000 iterations, and maximum learning rate of , following MASS (song2019mass)

. The positional embeddings are initialized randomly. We use dropout and attention dropout with probabilities

. With E2vidD6, pretraining takes 3-6 days depending on the objective and bidirectional finetuning takes up to 1.5 days, however those times could be improved by optimizing the data pipeline.

a.4 Example Predictions

We show examples of good and bad predictions on YouCook2 (Figure 5 and ViTT-All (Figure 4 and 5). The captions are generated by E2vidD6-BiD (no pretraining) and E2vidD6-MASS-BiD (text-only MASS pretraining).

max width= Sample Frame Ground Truth Ø-Pretraining MASS-Pretraining Comments width=0.19valign=M,frameplots/youcook2_segments/line_000792 crush and grate the garlic grate garlic and add to bowl (good) crush ginger and garlic(good) ginger is correct despite not appearing in ground truth. width=0.19valign=M,frameplots/youcook2_segments/line_000038 crimp shut with fork place the filling on the wrapper (ok) seal the edges of the wrapper (good) pretrained model is more specific width=0.19valign=M,frameplots/youcook2_segments/line_000002 place wings on the baking sheet and cook flipping bake the pizza in the oven (bad) cook the wings on the grill (good) only pretrained model predicted correct food width=0.19valign=M,frameplots/youcook2_segments/line_000010 add the pork back into the hot oil add the rice to the pot (bad) place the meat on the pan (good) Ø model hallucinates the rice and pot width=0.19valign=M,frameplots/youcook2_segments/line_000017 add thyme bay leaves onion and clam juice and boil the mixture add diced tomatoes tomato puree and mix well (bad) add thyme thyme onion and clam juice to the pot and stir (ok) Ø hallucinates a lot of nonexistent ingredients width=0.19valign=M,frameplots/youcook2_segments/line_000770 cook bacon in a pot with oil and pepper add chopped tomatoes to pan and stir (bad) add bacon and stir (ok) both models missed oil and pepper (not mentioned in ASR) width=0.19valign=M,frameplots/youcook2_segments/line_000053 pour dressing on top of the salad and toss add dressing to the bowl (good) serve the soup over the salad (bad) pretrained model referred to dressing as “soup” width=0.19valign=M,frameplots/youcook2_segments/line_000791 slice the ginger into pieces slice a celery (bad) slice the chicken (bad) both models had wrong ingredients (ASR segment does not mention what is being sliced)

Figure 3: Example good and bad predictions on YouCook2. The pretrained model is generally but not always better. Note that there are no “intro” or “outro”-like labels on YouCook2 because the dataset was specifically curated to only contain actual recipe steps.

max width= Sample Frame Ground Truth Ø-Pretraining MASS-Pretraining Comments width=0.19valign=M,frameplots/ldvm_segments/line_000184 tightening extra loop tightening the loop (good) tightening the loop (good) both models perform well width=0.19valign=M,frameplots/ldvm_segments/line_010080 adding eyeshadow blending eye shadow (good) applying eye shadow (good) both models perform well width=0.19valign=M,frameplots/ldvm_segments/line_000135 showcasing the finished look showing finished look(good) showing finished look(good) both models perform well width=0.19valign=M,frameplots/ldvm_segments/line_000484 rolling and folding the clay rolling and blending (ok) rolling and folding the clay (good) MASS is a bit more specific width=0.19valign=M,frameplots/ldvm_segments/line_001212 highlighting brow bone applying eye shadow (ok) brushing on the brows(good) MASS is a bit more specific width=0.19valign=M,frameplots/ldvm_segments/line_010053 covering the chicken and cooking cooking the bread (bad) cooking the chicken (good) only MASS got the right ingredient width=0.19valign=M,frameplots/ldvm_segments/line_000091 connecting spray hose and sprayer connecting the new cover (ok) connecting the valve (good) spray hose is more specific than valve width=0.19valign=M,frameplots/ldvm_segments/line_000113 implementing second layer showing finished product (ok) showing second layer (good) MASS is more specific width=0.19valign=M,frameplots/ldvm_segments/line_000114 making decorative trim cutting the edges (good) cutting the fabric (good) both models yield good predictions width=0.19valign=M,frameplots/ldvm_segments/line_000201 checking bleach container outro (bad) checking the container (good) MASS is a bit more specific width=0.19valign=M,frameplots/ldvm_segments/line_000174 demonstrating the flip checking the battery (bad) flipping the board (good) Ø model got influenced by car mechanics tutorials width=0.19valign=M,frameplots/ldvm_segments/line_010023 tilting board setting up the oven (bad) turning the board (good) Ø overfitted on cooking videos

Figure 4: Example good predictions on ViTT-All (Part 1). The pretrained model is generally but not always better.

max width= Sample Frame Ground Truth Ø-Pretraining MASS-Pretraining Comments width=0.19valign=M,frameplots/ldvm_segments/line_010102 securing the bar in place removing the cover (bad) checking for the other side (bad) predictions are not specific enough width=0.19valign=M,frameplots/ldvm_segments/line_010105 starting with unlocking bars opening the box (bad) pulling the car on (bad) predictions are incorrect or not specific enough width=0.19valign=M,frameplots/ldvm_segments/line_000083 demonstrating technique attaching paper (bad) stamping paper (good) the technique is about stamping the paper width=0.19valign=M,frameplots/ldvm_segments/line_000037 spritzing in additional water pouring water into the water (ok) adding water to water (ok) understandable but ungrammarly width=0.19valign=M,frameplots/ldvm_segments/line_000094 checking for leaks checking for the new new new new new new new new new new new new new new new (bad) checking the process (ok) Ø got into a loop, MASS not specific enough width=0.19valign=M,frameplots/ldvm_segments/line_010030 displaying materials needed intro (bad) removing paste (ok) prediction makes sense because narrator is displaying thermal paste remover width=0.19valign=M,frameplots/ldvm_segments/line_000023 sketching on the swirls drawing the lines (good) drawing on the eyes (bad) pretrained model overfitted on makeup tutorials width=0.19valign=M,frameplots/ldvm_segments/line_000286 crimping wire and completing project attaching the screws (bad) attaching the wire to the wire (ok) both models have trouble with the concept of crimping a wire width=0.19valign=M,frameplots/ldvm_segments/line_000236 cutting with guide line cutting the top of the top of the top of the top of the top of the top (bad) explaining process (ok) Ø model got into a loop, MASS model is not specific enough

Figure 5: Example ok and bad predictions on ViTT (Part 2). The pretrained model is generally but not always better.

a.5 Full result tables

We present here tables with all the ablation results that we run. There are two main takeaway messages from the results involving the pretraining approach: (a) the accuracy improvements, as measured across all the metrics we use, indicate the value of using a pretraining approach to this problem, specifically one that is capable of leveraging the ASR signals at both pretraining and finetuning stages, and (b) the training speedup achieved from pretraining is impressive, as a pretrained model converges much faster than training from scratch. This is especially visible on ViTT-All where finetuning after MASS pretraining reaches best Rouge-L score at epoch 2, whereas it takes around 11 epochs to converge when training from scratch.

max width= Method Input Pretraining BLEU-4 METEOR ROUGE-L CIDEr Constant Pred (hessel2019case) - - 2.70 10.30 21.70 0.15 MART Lei2020MARTMR Video - 8.00 15.90 - 0.36 DPC shi2019dense Video + ASR - 2.76 18.08 - - EMT Zhou2018EndtoEndDV Video - 4.38 11.55 27.44 0.38 CBT sun2019contrastive Video Kinetics + HowTo100M 5.12 12.97 30.44 0.64 AT (hessel2019case) ASR - 8.55 16.93 35.54 1.06 AT+Video (hessel2019case) Video + ASR - 9.01 17.77 36.65 1.12 UniViLM #1 (luo2020univilm) Video - 6.06 12.47 31.48 0.64 UniViLM #2 (luo2020univilm) Video + ASR - 8.67 15.38 35.02 1.00 UniViLM #5 (luo2020univilm) Video + ASR HowTo100M 10.42 16.93 38.02 1.20 Ø Pretraining E2D2-UniD ASR - 7.42 15.15 33.26 0.85 E2D6-UniD ASR - 7.88 15.29 34.10 0.87 E2D2-BiD ASR - 6.85 15.64 34.26 0.91 E2D6-BiD ASR - 7.90 15.70 34.86 0.93 E2vidD2-UniD Video + ASR - 7.47 15.11 34.77 0.90 E2vidD6-UniD Video + ASR - 7.61 15.57 34.28 0.89 E2vidD2-BiD Video + ASR - 8.39 15.36 34.54 0.91 E2vidD6-BiD Video + ASR - 8.01 16.19 34.66 0.91 E2vidD2-BiDalt Video + ASR - 8.12 15.83 34.83 0.93 E2vid,D6-BiDalt Video + ASR - 7.70 16.11 34.78 0.91 E2vidD2-BiD (S3D) Video + ASR - 8.04 16.17 36.01 0.96 E2vidD6-BiD (S3D) Video + ASR - 7.91 16.28 35.23 0.93 Text Pretraining E2D2-MASS-UniD ASR YT8M-cook + Recipe1M 10.52 17.14 37.39 1.14 E2D6-MASS-UniD ASR YT8M-cook + Recipe1M 10.72 17.74 37.85 1.17 E2D2-MASS-BiD ASR YT8M-cook + Recipe1M 10.84 17.44 37.20 1.13 E2D6-MASS-BiD ASR YT8M-cook + Recipe1M 10.60 17.42 38.08 1.20 E2vidD2-MASS-UniD Video + ASR YT8M-cook + Recipe1M 10.84 17.39 38.24 1.16 E2vidD6-MASS-UniD Video + ASR YT8M-cook + Recipe1M 11.39 18.00 38.71 1.22 E2vidD2-MASS-BiD Video + ASR YT8M-cook + Recipe1M 11.38 18.04 38.67 1.19 E2vidD6-MASS-BiD Video + ASR YT8M-cook + Recipe1M 11.47 17.70 38.80 1.25 E2vid,D2-MASS-BiDalt Video + ASR YT8M-cook + Recipe1M 11.49 17.85 38.60 1.18 E2vid,D6-MASS-BiDalt Video + ASR YT8M-cook + Recipe1M 11.07 17.68 38.43 1.22 E2vidD2-MASS-BiD (S3D) Video + ASR YT8M-cook + Recipe1M 11.13 17.71 38.57 1.12 E2vidD6-MASS-BiD (S3D) Video + ASR YT8M-cook + Recipe1M 11.64 18.04 38.75 1.24 Multimodal Pretraining E2vidD2-MASSalign-BiD Video + ASR YT8M-cook + Recipe1M 11.54 17.57 37.70 1.15 E2vidD6-MASSalign-BiD Video + ASR YT8M-cook + Recipe1M 11.53 17.62 39.03 1.22 E2vidD2-MASSvid-BiD Video + ASR YT8M-cook + Recipe1M 11.17 17.71 38.32 1.17 E2vidD6-MASSvid-BiD Video + ASR YT8M-cook + Recipe1M 12.04 18.32 39.03 1.23 E2vidD2-MASSdrop-BiD Video + ASR YT8M-cook + Recipe1M 11.21 17.99 38.72 1.23 E2vidD6-MASSdrop-BiD Video + ASR YT8M-cook + Recipe1M 10.45 17.74 38.82 1.22 Human (hessel2019case) Video + ASR - 15.20 25.90 45.10 3.80

Table 10: Video Captioning Results on YouCook2. We use YT8M-cook/Recipe1M for pretraining. All video features are Compact 2D (wang2014learning) except when marked as S3D (xie2018rethinking).

max width= Method Input Pretraining BLEU-1 METEOR ROUGE-L CIDEr Constant baseline (“intro”) - - 1.42 3.32 11.15 0.28 Ø Pretraining E2D2-UniD ASR - 17.94 8.55 27.06 0.64 E2D6-UniD ASR - 18.91 8.96 27.80 0.67 E2D2-BiD ASR - 18.81 8.82 27.63 0.65 E2D6-BiD ASR - 19.60 9.12 27.88 0.68 E2vidD2-UniD Video + ASR - 18.94 8.99 28.05 0.67 E2vidD6-UniD Video + ASR - 19.29 9.15 27.97 0.69 E2vidD2-BiD Video + ASR - 19.37 9.21 28.56 0.69 E2vidD6-BiD Video + ASR - 19.49 9.23 28.53 0.69 Text Pretraining E2D2-MASS-UniD ASR HowTo100M + WikiHow 21.53 10.24 29.95 0.77 E2D6-MASS-UniD ASR HowTo100M + WikiHow 22.09 10.58 30.67 0.79 E2D2-MASS-BiD ASR HowTo100M + WikiHow 20.73 10.20 30.15 0.76 E2D6-MASS-BiD ASR HowTo100M + WikiHow 21.93 10.60 30.45 0.79 E2vidD2-MASS-UniD Video + ASR HowTo100M + WikiHow 21.46 10.45 30.56 0.78 E2vidD6-UniD Video + ASR HowTo100M + WikiHow 22.21 10.75 30.86 0.81 E2vidD2-MASS-BiD Video + ASR HowTo100M + WikiHow 21.78 10.64 30.72 0.79 E2vidD6-MASS-BiD Video + ASR HowTo100M + WikiHow 22.44 10.83 31.27 0.81 Multimodal Pretraining E2vidD2-MASSalign-BiD Video + ASR HowTo100M + WikiHow 22.07 10.33 30.60 0.77 E2vidD6-MASSalign-BiD Video + ASR HowTo100M + WikiHow 22.31 10.66 31.13 0.79 E2vidD2-MASSvid-BiD Video + ASR HowTo100M + WikiHow 22.15 10.75 31.06 0.80 E2vidD6-MASSvid-BiD Video + ASR HowTo100M + WikiHow 22.45 10.76 31.49 0.80 E2vidD2-MASSdrop-BiD Video + ASR HowTo100M + WikiHow 21.84 10.55 31.10 0.79 E2vidD6-MASSdrop-BiD Video + ASR HowTo100M + WikiHow 22.37 11.00 31.40 0.82 Human estimate Video + ASR - 43.34 33.56 41.88 1.26

Table 11: Video captioning results on ViTT-All. We use HowTo100M/WikiHow for pretraining. We also estimate human performance (details in Appendix A.1; Table 9).

max width= Method Input Pretraining BLEU-1 METEOR ROUGE-L CIDEr Constant baseline (“intro”) - - 1.16 2.93 10.21 0.25 Ø Pretraining E2D2-UniD ASR - 19.73 9.43 27.95 0.69 E2D6-UniD ASR - 20.24 9.93 28.59 0.71 E2D2-BiD ASR - 19.73 9.72 27.92 0.68 E2D6-BiD ASR - 20.77 10.08 28.63 0.72 E2vidD2-UniD Video + ASR - 19.97 9.75 28.30 0.69 E2vidD6-UniD Video + ASR - 20.46 9.93 28.62 0.69 E2vidD2-BiD Video + ASR - 20.60 10.08 29.45 0.71 E2vidD6-BiD Video + ASR - 20.45 9.88 28.88 0.69 Text Pretraining E2D2-MASS-UniD ASR YT8M-cook + Recipe1M 22.89 11.53 31.62 0.84 E2D6-MASS-UniD ASR YT8M-cook + Recipe1M 24.47 12.22 32.51 0.90 E2D2-MASS-BiD ASR YT8M-cook + Recipe1M 22.75 11.63 31.54 0.84 E2D6-MASS-BiD ASR YT8M-cook + Recipe1M 24.79 12.25 32.40 0.88 E2vidD2-MASS-UniD Video + ASR YT8M-cook + Recipe1M 23.86 11.85 32.32 0.86 E2vidD6-MASS-UniD Video + ASR YT8M-cook + Recipe1M 24.32 12.32 32.90 0.90 E2vidD2-MASS-BiD Video + ASR YT8M-cook + Recipe1M 22.93 11.68 32.15 0.87 E2vidD6-MASS-BiD Video + ASR YT8M-cook + Recipe1M 24.22 12.22 32.60 0.89 Multimodal Pretraining E2vidD2-MASSalign-BiD Video + ASR YT8M-cook + Recipe1M 24.02 11.91 32.73 0.86 E2vidD6-MASSalign-BiD Video + ASR YT8M-cook + Recipe1M 24.92 12.25 33.09 0.90 E2vidD2-MASSvid-BiD Video + ASR YT8M-cook + Recipe1M 24.15 12.10 32.96 0.88 E2vidD6-MASSvid-BiD Video + ASR YT8M-cook + Recipe1M 24.87 12.43 32.97 0.90 E2vidD2-MASSdrop-BiD Video + ASR YT8M-cook + Recipe1M 23.70 12.01 32.71 0.88 E2vidD6-MASSdrop-BiD Video + ASR YT8M-cook + Recipe1M 24.48 12.22 33.10 0.89 Human estimate Video + ASR - 41.61 32.50 41.59 1.21

Table 12: Video captioning results on ViTT-Cooking. We use YT8M-cook and Recipe1M for optional pretraining.