Video Understanding as Machine Translation

06/12/2020 ∙ by Bruno Korbar, et al. ∙ Facebook 11

With the advent of large-scale multimodal video datasets, especially sequences with audio or transcribed speech, there has been a growing interest in self-supervised learning of video representations. Most prior work formulates the objective as a contrastive metric learning problem between the modalities. To enable effective learning, however, these strategies require a careful selection of positive and negative samples often combined with hand-designed curriculum policies. In this work we remove the need for negative sampling by taking a generative modeling approach that poses the objective as a translation problem between modalities. Such a formulation allows us to tackle a wide variety of downstream video understanding tasks by means of a single unified framework, without the need for large batches of negative samples common in contrastive metric learning. We experiment with the large-scale HowTo100M dataset for training, and report performance gains over the state-of-the-art on several downstream tasks including video classification (EPIC-Kitchens), question answering (TVQA), captioning (TVC, YouCook2, and MSR-VTT), and text-based clip retrieval (YouCook2 and MSR-VTT).



There are no comments yet.


page 2

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Labeling and curating video understanding datasets is a laborious and expensive process. Most recent video datasets are collected according to one of two possible paradigms: 1) download large amounts of Web videos, and manually label them according to a set of predefined action classes Soomro et al. (2012); Kuehne et al. (2011); Kay et al. (2017); Gu et al. (2018); or 2) draft a set of textual descriptions of actions of interest, and use human actors to act them out Schuldt et al. (2004); Sigurdsson et al. (2016); Goyal et al. (2017). Besides being time-consuming and costly, these approaches are limited to a closed-world understanding of videos. For example, a model trained on an ontology of professional sports classes is unlikely to be useful to recognize everyday kitchen activities. This raises a crucial research question: how can we learn generic video representations that are useful for multiple downstream tasks, without having to label or enact videos with every possible observable action?

Recent work in self-supervised video representation learning attempts to provide a solution. Such approaches typically define pretext tasks that transform the video in a certain way, and train the model to predict that transformation Misra et al. (2016); Wei et al. (2018); Benaim et al. (2020). While these transformations could be defined purely in the video pixel space, several approaches exploit the multimodal nature of videos and propose learning transformations from one modality to the other Arandjelovic and Zisserman (2017); Owens and Efros (2018); Korbar et al. (2018); Ng et al. (2018); Miech et al. (2019)

. These modalities include optical flow/motion, audio, and more recently, speech transcribed to text using automatic speech recognition (ASR). Recently introduced datasets such as Cooking312K 

Sun et al. (2019b) and HowTo100M Miech et al. (2019) leverage YouTube videos with associated ASR text to train joint video-and-text embeddings that have significantly improved performance on multiple downstream video and video + language tasks Miech et al. (2019, 2020); Sun et al. (2019a).

Figure 1:

Overview of the proposed architecture. A multi-modal encoder-decoder transformer is trained on HowTo100M to reconstruct perturbed text, and finetuned on several downstream tasks (i.e., classification, question answering, captioning and retrieval). During finetuning, we combine the trained multi-modal encoder with either the trained decoder (for open-ended text generation and ranking) or a simple linear layer (if candidates outputs are provided—e.g., labels for classification).

The typical approach followed in such joint embedding models is that of metric learning. Specifically, models are trained to map different modalities into a common embedding space such that distances between modalities in this space preserve instance information. The embedding is typically optimized using a ranking Chechik et al. (2010) or contrastive Oord et al. (2018) objective on the pairwise distances between modalities. While effective, such metrics are sensitive to the set of negatives used to optimize the model. Easy negatives (e.g., audio-image pairs where the audio is taken from a video sequence unrelated to that used to extract the image features) make the metric learning too easy and do not yield informative gradients. On the other hand, hard negatives (e.g., using overlapping temporal segments for the two modalities) make the optimization overly challenging and lead to underfitting. Curriculum strategies that render the learning increasingly difficult Korbar et al. (2018); Han et al. (2019) have been shown to be beneficial but require the challenging hand-design of time-evolving sampling policies. The second drawback of contrastive approaches is that they rely on large batches of negatives to learn effective representations Chen et al. (2020). However, the batch size is effectively limited by the GPU memory size. Memory banks Wu et al. (2018) and queue-based models He et al. (2020) stretch the limits of contrastive learning with small mini-batches but introduce approximations caused by stale embeddings during learning.

To address these shortcomings we propose VideoTranslate, a fully generative solution to joint embedding learning based on an encoder-decoder architecture. Unlike ranking or contrastive approaches, our method does not require any negative examples, thus eliminating the need for careful design of sampling strategies and making our training amenable to reasonable batch sizes. We frame the task as a machine translation problem from one modality to another, effectively treating the video modality as a different “language” to be translated into text. Specifically, we experiment with the problem of predicting the narration of instructional video using the recently introduced large-scale HowTo100M dataset Miech et al. (2019). However, instead of optimizing a model to determine if a given pair of spoken text and video “match,” we train a language machine to generate the spoken text associated to the video clip. As one can imagine, learning a model that accurately translates video into spoken text is a very hard task, which would render the optimization simply too difficult. To ease the problem, we propose to provide as additional input to our model a “garbled” version of the spoken text. The noise-corrupted text serves as an “anchor” to successfully tackle the problem of speech generation which would be overly-ambiguous without any language context as input.

Our use of noise perturbation on textual input is inspired by recent work Sun et al. (2019b, a) proposing to train visual-linguistic models by “filling-in” the blanks in accompanying text. However, these prior models predict missing tokens independently and thus cannot directly be used for fully-generative text synthesis (e.g., captioning) unless this is reformulated as filling-in the blanks of a partly masked-out output. Unlike encoder-only architectures such as BERT Devlin et al. (2019), our solution enables the learning of an autoregressive decoder trained to generate text from the multimodal embedding computed from the encoder. This renders our approach fully generative and permits its use at test time without auxiliary text input. Moreover, our encoder-decoder architecture can be transferred and finetuned on several downstream tasks, even novel ones, instead of producing features for a separately-trained model.

Our experiments demonstrate that VideoTranslate

can be transferred effectively to a variety of video understanding tasks without the need to alter the model or to append discriminative heads (e.g., classifiers). We merely cast each of these tasks as text prediction and use

generative finetuning to further adapt the model to each downstream task. While some of the tasks are naturally defined in the language domain (such as question answering or captioning), we show that our system can obtain excellent performance even on classic discriminative-learning problems such as video retrieval or classification. In fact, our method achieves good accuracy on classification and retrieval even without any form of finetuning on the target dataset, demonstrating strong performance in open-world video understanding. In terms of absolute performance, our model with finetuning achieves the best reported accuracy on the captioning benchmarks of YouCook2 and TVC, as well as on the visual question answering task in TVQA.

2 Related Work

Self-supervised video representations.

Inspired by the recent success of self-supervised representation learning in language Radford et al. (2018); Raffel et al. (2019); Peters et al. (2018); Liu et al. (2019) and images He et al. (2020); Misra and van der Maaten (2020); Chen et al. (2020), there has been a growing interest in applying similar techniques to learn video representations. Typical approaches use a pretext task defined by predicting transformations of the video data, such as temporal ordering Misra et al. (2016); Wei et al. (2018); Xu et al. (2019); Lee et al. (2017); Fernando et al. (2017), color or geometric transformations Vondrick et al. (2018); Jing et al. (2018), multiple views, motion/speed, and optical flow Tian et al. (2019); Wang et al. (2019b); Jayaraman and Grauman (2015); Benaim et al. (2020). Although promising, such approaches have still lagged far behind fully supervised video representations Carreira and Zisserman (2017); Tran et al. (2018). Our work, instead, leverages free supervision in the form of synchronized modalities available with videos, specifically speech transcribed using ASR, which we discuss next.

Multi-modal self-supervision.

Videos are generally accompanied by a number of auxiliary data sources which can serve as potent sources of free supervision. The most informative such source is associated metadata, such as tags or titles from social media, which have shown strong transfer performance Ghadiyaram et al. (2019); Li and Wang (2020). In the case of video, there has been a larger focus on using accompanying modalities such as audio Korbar et al. (2018); Owens and Efros (2018); Arandjelovic and Zisserman (2017, 2018); Alwassel et al. (2019) and speech, transcribed to text using ASR Miech et al. (2019); Sun et al. (2019a, b); Miech et al. (2020). Our work is most closely related to the latter thread, with the key difference being our generative translation-based formulation Raffel et al. (2019) compared to the contrastive metric learning used in most prior work Miech et al. (2020); Sun et al. (2019a).

General-purpose language models.

Prior work in general-purpose encoder-based language models Devlin et al. (2019); Liu et al. (2019), pre-trained with self-supervision on large textual corpora, led to strong improvements after finetuning on various downstream classification Wang et al. (2018, 2019a) and ranking Wu et al. (2019); Karpukhin et al. (2020) tasks. More recently, pretrained encoder-decoder models, such as BART Lewis et al. (2019) and T5 Raffel et al. (2019), achieve further improvements on both discriminative and generative tasks. Our work exploits an encoder-decoder architecture, initialized with T5 weights, in order to learn to generate text from multi-modal representations.

3 Technical Approach

We argue that many video-understanding tasks can be reformulated as text-generation problems. For example classifying an action in a video is nothing more than naming it with words. We use an encoder-decoder architecture as a general solution that enables training with varied supervision (i.e., captions and free-form labels in HowTo100M) and transfer learning to new tasks (e.g., VQA). Our encoder is multi-modal and can be combined during fine-tuning and inference with either (i) the trained decoder for open-ended text generation (e.g, for captioning) or (ii) a simple linear layer if the output space is constrained (e.g., for classification, or multiple‐choice VQA). An overview of our model and the overall approach can be seen in Figure


3.1 A Unified Framework for Video-To-Text Translation

Given a task , a video , and a target text corresponding to the ground-truth answer, we train our model to generate this desired text from the video and a noisy version of the text itself , i.e., . The model is trained to minimize a loss between the predicted text and the ground-truth text , and the noise function generates a garbled version of the input text that the model is trained to reconstruct. As further discussed below, we use both masking and shuffling as noise perturbations. We discuss each component of the model next.


Our unified video-to-text model is composed of four main building blocks: a video feature extractor (), text embedding (), multimodal encoder () and a text decoder (). The video feature extractor is implemented as a R(2+1)D-34 network Tran et al. (2018) pre-trained on the IG65M dataset Ghadiyaram et al. (2019). It generates a latent video representation, which we spatially average-pool to , and pass through a linear layer to transform the embedding dimension to , such that it matches the embedding dimension of the text features discussed next. We refer to this video representation as . The text embedding is implemented via a lookup table for words, of dimensions , where is the vocabulary size (30K in all our experiments). We encode the task, the noise-perturbed text, and a special [CLS] token using this embedding model, i.e., , and

, generating embedding tensors of dimensions

, and respectively. Here and are the maximum number of tokens used to represent the task and text respectively. Next, the multimodal encoder is implemented as an encoder part of a T5 model Raffel et al. (2019), implemented using the Transformer architecture Vaswani et al. (2017) and pre-trained on the “Colossal Clean Crawled Corpus” (C4) Raffel et al. (2019). We concatenate video features with the text, task and [CLS] embeddings from the encoder above and pass them through the multimodal encoder, i.e. . is a tensor of dimensions , that incorporates the interaction of text, task and video features using self-attention, and generates -dimensional features for each. Finally, we pass this resulting feature tensor to the text decoder. The text decoder is implemented using the T5 model decoder, also pretrained on C4. It takes as input the encoded features, and outputs a tensor of size . The tensor defines probability distributions over the words in the dictionary, which can be used to either sample or evaluate candidate text answers. Note that our model is not limited to operate only on fixed-length strings: while

is the maximum number of tokens supported, we pad smaller strings with dummy tokens and both our text encoder and decoder ignore the padded tokens when encoding or generating the output. The noise function

used above is implemented by shuffling the input tokens and masking out of them, where

is a number randomly sampled from a uniform distribution between

and .

We note that our model design is surprisingly straightforward. Indeed, the primary contributions of this work is to show that video can be treated just as another “language” by directly feeding CNN features extracted from it into an unmodified encoder-decoder architecture originally designed to address a variety of NLP tasks. Despite its simplicity, our experiments demonstrate that this unified vision/NLP framework can outperform specialized designs on a variety of downstream video tasks.


We optimize all the parameters of the model to minimize the loss . This loss is defined as the mean cross-entropy of word distributions predicted by the model. Our framework of video-to-text translation is flexible enough to allow us to train the model with multiple tasks concurrently without having to add a specialized network head for each of the tasks. Our strategy makes it possible to devote the entire capacity of the encoder-decoder to all tasks, without having to face the typical multi-task dilemma of where to split the shared trunk into task-specific heads. Inference for all tasks is done through the same exact encoder-decoder model but with different actions triggering different executions through the network for the same video input .

The primary training of our model is done on the weakly-labelled HowTo100M dataset Miech et al. (2019), which includes 136M clips with accompanying textual narrations automatically extracted from the audio channel using ASR. Each clip comes also with a textual description of the activity performed in the video. The textual description correspond to one of 23K total categories, labeled automatically using the clip metadata and is about 8 tokens long on average. We represent narration and description using and tokens, respectively. We clip any longer texts to these lengths. We leverage both types of annotations by performing joint multi-task training, by forcing the network to predict both the activity (using CLASSIFY as task to ignite a classification inference) as well as the narration (using CAPTION as task to trigger a captioning inference) associated to each clip in the mini-batch. For added efficiency of training, we re-use each mini-batch of examples twice in a row by performing back-to-back training with respect to first narration and then description. This is done by simply updating the task and the corresponding target text in between the two updates.

Note that our formulation of “video understanding as machine translation” makes it possible to use directly the model resulting from this training on different downstream tasks without any further finetuning, as the text generated from the model is a flexible output representation, e.g., it can be used to perform classification with “unseen” class labels without modification or retraining. As shown in our experiments, our model used in this zero-shot setting achieves already good accuracy on many of our downstream tasks. This is a challenging feat considering the domain gap that separates HowTo100M from the downstream datasets (e.g., EpicKitchens contains only first-person view videos recorded in kitchens, which look substantially different from the varied instructional videos of HowTo100M).


Finetuning on the downstream tasks further elevates the accuracy of our model. We experiment with two types of finetuning schemes. We can apply a generative finetuning scheme which consists of minimizing the cross entropy loss on the generated text using as ground truth the labels of the downstream task in textual form. As discussed in the experiments, this also makes it possible to use multiple sources of supervision when different types of annotations are available on the downstream dataset. For example, EpicKitchens includes verb and noun class-labels, as well as action descriptions, which we can leverage simultaneously in our unified multi-task generative setting. For classification tasks, we also experiment with appending and training a linear layer as a classification “head” on top of the multimodal encoder embedding computed for the [CLS] token. We refer to this strategy as discriminative finetuning.


In case of discriminative finetuning, at test time we use the output of the linear layer as the output. This is typically done for the CLASSIFY tasks. In case of generative finetuning we consider two alternatives. When no candidate outputs are provided (e.g., for captioning) we decode open-ended text Massarelli et al. (2019). When candidates are provided (e.g., multiple-choice VQA, retrieval) we follow Nogueira et al. Nogueira et al. (2020)

and rank the possible choices in textual form (i.e., candidate answers for multiple-choice VQA, candidate class labels for classification) according to the decoder logits for these candidate outputs.

4 Experiments

In this section, we experimentally evaluate our model. We start with an overview of the implementation used in our experiments, followed by a discussion of the downstream tasks and results.

4.1 Implementation details

Model configuration.

In the appendix, we present a comprehensive evaluation reporting the effect of different design choices in our model on downstream performance. Here we summarize the best configuration evinced from that empirical study. The input to our video feature extractor is a sequence of 32 frames uniformly spaced-out in the video clip. We apply both masking as well as shuffling to the input text during training; the amount of masking is randomly chosen from a uniform distribution between 0% and 100% for each mini-batch. We generatively train our model to perform both captioning and classification on HowTo100M, as this results in better performance on all downstream tasks compared to single-task training. The video feature extractor (R(2+1)D) and the T5 encoder-decoder are jointly trained on HowTo100M, starting from their pretrained versions on IG65M and C4, respectively, in order to reduce the cost of training. While we envision that it may be possible to learn both our video network and the T5 model from scratch using HowTo100M data, we reserve this for future investigation. For our encoder-decoder, we use the t5-large model with 770M parameters.

Training hyper-parameters.

Our model is trained on HowTo100M dataset using Adam  with two distinct parameter groups - one for our video feature extractor and one for the multi-modal transformer. We use a base learning rate of for the transformer parameters and for the parameters of the video model. The model is trained for 250K iterations using the inverse square-root schedule following the procedure from Raffel et al. (2019) with warm-up iterations. We set a constant batch size of 8 examples per GPU, and distribute the training over 128 NVidia V100 GPUs. For specifics of video and text pre-processing please refer to the appendix.

4.2 Experimental results on downstream tasks

Validation Set Test Set (S1) Test Set (S2)
Method Finetuning Verb Noun Action Verb Noun Action Verb Noun Action
R(2+1)D-34 yes 56.0 34.8 24.9
R(2+1)D-152 Ghadiyaram et al. (2019) yes 57.3 35.7 25.6 65.2 45.1 34.5 58.4 36.9 26.1
GBlend Wang et al. (2019c) yes 59.2 36.1 25.6 66.7 48.5 37.1 58.3 36.7 26.6
BAIDU Wang et al. (2019d) yes 63.2 39.1 29.0 69.8 53.3 41.4 59.7 34.2 25.1
VideoTranslate (ours) no 53.3 34.0 24.6 57.8 35.1 24.9 52.6 34.1 24.8
VideoTranslate (ours) yes: gen 56.3 35.6 25.7 63.8 44.0 34.0 57.2 36.2 26.3
VideoTranslate+gen (ours) yes: gen-MT 58.0 36.5 26.0 65.4 46.0 37.0 58.5 36.9 26.8
VideoTranslate (ours) yes: discr 59.4 37.8 27.8 66.1 48.4 37.2 58.6 37.0 26.9
Table 1: Video classification: comparison to the state-of-the-art on EPIC-Kitchens.

4.2.1 Video Classification

We evaluate classification performance of our model on EPIC-Kitchens Damen et al. (2018), which is a egocentric video classification dataset with about 28K training video clips, each labeled with one of 352 noun and one of 125 verbs. We report accuracy on the separate noun and verb classification tasks, as well as on the combined (noun,verb) classification (known as action classification). We measure performance on the validation set as well as both splits of the test set (using the online evaluation server), corresponding to seen (S1) and unseen (S2) kitchens. We chose EPIC-Kitchens as classification test bed for our method since it requires recognizing human-object interactions akin to those found in HowTo100M, unlike other video classification datasets which focus on humans exclusively. At the same time, the ego-centric aspect and the restriction to kitchen settings make this a good benchmark to assess the generalization ability of our model. We present top-1 results obtained with several variants of VideoTranslate in Table 1 (see appendix for top-5 numbers). It can be noted that VideoTranslateprovides already decent accuracy without any form of training on EPIC-Kitchens. Generative finetuning effectively finetunes our model to generate the EPIC-Kitchens labels as captions. This yields a significant boost in accuracy over the results without finetuning: , , on verb, noun, and action, respectively, on the validation set. We also note that the performance of VideoTranslate with generative finetuning is already superior to that achieved by R(2+1)D-34 (i.e., our video feature extractor) pretrained on IG-65M and discriminatively finetuned on EPIC-Kitchens. Thus, this already shows the added value of our modeling approach. Since EPIC-Kitchens videos come with longer action descriptions in a form of grammatically correct sentence, we also experiment with a multi-task version of generative finetuning (gen-MT), where we finetune VideoTranslate to generate both captions and class labels (using two different prompts). At test time, we first generate a caption from video, and then feed the predicted caption as additional text-input to VideoTranslate when generating class labels. This yields substantial additional gain. Finally, we also present accuracies obtained by finetuning VideoTranslate discriminatively. This produces the best results overall for our model: compared to R(2+1)D-34, the gain on the validation set is , , on verb, noun, and action, respectively. It can be noted that this variant of our model achieves the best reported numbers for noun and action on the unseen kitchens test split (S2). This is indicative of the generalization ability of VideoTranslate.

4.2.2 Captioning

We measure the captioning performance of our system on YouCook2 Zhou et al. (2018), TVC Lei et al. (2020), and MSR-VTT Xu et al. (2016). YouCook2 is a cooking video dataset from YouTube with 14K video clips and associated textual descriptions. MSR-VTT contains 200K generic videos clips and associated captions. TVC consists of 4198 TV-show videos, with a caption and a subtitle in English for each clip. For TVC, we present results both with and without using subtitles as auxiliary input to the encoder-decoder. For all of these benchmarks we measure performance in terms of BLEU4 score on the standard dataset splits defined by the authors. The captions for YouCook2 and MSR-VTT are generated using a top- generation strategy with . The captions for TVC are generated using greedy decoding strategy as outlined in Lei et al. (2020).

The results are summarized in Tables 2 and 3. Without any form of finetuning on the downstream dataset, VideoTranslate achieves already good accuracy. With generative finetuning, VideoTranslate outperforms all previous models on both YouCook2 and TVC, and it achieves performance approaching the best reported results on MSR-VTT.

Table 4 provides a few qualitative captioning examples.

YouCook2 MSR-VTT
VideoBERT Sun et al. (2019b) yes 4.3 11.9 - -
CBT Sun et al. (2019a) yes 5.1 13.0 - -
ORG Zhang et al. (2020) yes - - 43.6 28.8
VideoTranslate (ours) no 3.0 7.4 21.2 12.9
VideoTranslate (ours) yes: gen 5.3 13.4 41.7 28.5
Table 2: Captioning results on YouCook2 and MSR-VTT.
Method Subtitles BLEU4 METEOR
MMT Lei et al. (2020) yes 10.53 16.61
VideoTranslate (ours) no 9.70 15.31
VideoTranslate (ours) yes 11.26 16.97
Table 3: Captioning results on TVC. All models are finetuned.
Video Input Decoder out Gold

washing the the vegetables

still stirring vegetables

cut t ing with the the knife

take knife

put t ing the the kettle down

put down counter

put away the the sponge

place sponge away

Table 4: Examples of captions generated by VideoTranslate for EPIC-Kitchens videos.
YouCook2 MSR-VTT
Method Finetuning R@1 R@10 R@1 R@10
Miech et al. 2019a Miech et al. (2019) no 6.1 24.8 7.5 29.6
Miech et al. 2019a Miech et al. (2019) yes 8.2 35.3 14.9 52.8
Miech et al. 2019b Miech et al. (2020) no 15.1 51.2 9.9 32.4
VideoTranslate (ours) no 8.4 37.0 12.2 38.8
VideoTranslate (ours) yes: gen 11.6 43.9 14.7 52.8
Table 5: Text-based video retrieval on YouCook2 and MSR-VTT.

4.2.3 Text-based Retrieval

To assess the ability of our system to retrieve clips given text queries, we use again YouCook2 as well as MSR-VTT. We use the dataset splits and the same exact evaluation protocol as defined in Miech et al. (2019), where performance is measured as recall@K (R@K). We reformulate retrieval as a generative task suitable for our model by evaluating the text queries under the probability distribution generated for each candidate video clip. This allows us to evaluate our model directly without any form of finetuning. Table 5 provides a comparison of our model against the methods presented in Miech et al. (2019) and  Miech et al. (2020). We observe that VideoTranslate without any form of finetuning is already competitive despite not being trained on these datasets or even these task. Conversely, we note that the training proposed in Miech et al. (2019, 2020) directly optimizes the models to discriminate between matching vs non-matching text-video pairs, which is effectively the task considered here. Generative finetuning of VideoTranslate (in the form of captioning) on these downstream datasets, elevates further the performance of our approach, despite not directly optimizing our model for the metric considered on these benchmarks.

4.2.4 Video Question Answering

We use the TVQA Lei et al. (2018) benchmark to assess question answering performance. TVQA is defined over the same set of videos as TVC, and contains several multiple-choice question-answer pairs per video. We use the validation sets defined by the authors to report performance. Note that, since VQA annotations are not available in HowTo100M, we introduce a new prompt denoting this task during finetuning on TVQA, passing the question as textual input to our model. In order to provide additional context to the model we additionally experiment with concatenating the question with subtitles Petroni et al. (2020). Table 6 compares the results achieved by our approach against those of the state-of-the-art on this benchmark. Generative finetuning of VideoTranslate yields already better results than those reported in prior work. Note that we use beam-search with beam size 5 for generation at test time.

Since each question involves a fixed set of answers, we also consider discriminative finetuning of our model by feeding both question and each candidate answer as text input and training a linear layer on top the resulting embedding obtained for the [CLS] token. As expected, this improves the results of VideoTranslate over those obtained with generative finetuning. Finally, we also consider fusing the predictions of the generative and the discriminative versions of our model. This is done by adding a softmax nonlinearity over the fixed set of answers for the scores obtained with the generative model. This yields a probability distribution over answers that can be averaged with that of the discriminative version of our model. As noted in the rows of the table, this fusion further elevates the accuracy, producing a large gain of 13.62% in top-1 accuracy on the test set compared to the best reported number in the literature for the setting involving no subtitles.

Method Finetuning Subtitle as input Val Test*
TVQA Lei et al. (2018) yes no 45.03 -
BERT QA Yang et al. (2020) yes no 48.95 49.23
VideoTranslate (ours) yes:gen no 58.72 58.39
VideoTranslate (ours) yes:discr no 61.09 60.55
VideoTranslate (ours) gen+discr yes: gen+discr no 63.01 62.84
TVQA Lei et al. (2018) yes yes 67.70 -
BERT QA Yang et al. (2020) yes yes 72.41 72.23
STAGE Lei et al. (2019) yes yes 70.50 -
VideoTranslate (ours) yes:gen yes 73.51 73.45
VideoTranslate (ours) yes:discr yes 75.19 75.01
VideoTranslate (ours) gen+discr yes: gen+discr yes 76.38 76.22
Table 6: Visual Question Answering results on the TVQA dataset.

5 Conclusions

In this work we present an approach that formulates video understanding as machine translation. We argue that this provides several advantages. First, our fully-generative approach bypasses the problem of easy/hard negative sampling that mars constrastive learning methods. Second, it allows us to devote our entire encoder-decoder architecture to address multiple tasks simultaneously without the need to design task-specific heads. Finally, the casting of video understanding as open-ended text generation enables strong transfer performance even without finetuning. Software and models described in this work will be made available upon publication.

Broader Impact

The broader impact of this work falls predominantly in the application areas of video understanding systems, such as action recognition, multimedia search, as well as human-computer interaction. The authors do not foresee major ethical issues associated to this work. As with most machine learning systems, our approach is susceptible to biases present in the distribution of the data. This is particularly true for self-supervised approaches such as ours. Since most videos used in our work have origins in English-speaking, Western regions of the world, we anticipate our learned representations to be most effective for applications in these geographic areas. However, as future work we plan to use an internationalized version of the training set 

Sigurdsson et al. (2020), which should help assuage such biases.


The authors would like to thank Patrick Lewis for discussions, and Shubho Sengupta for help with the infrastructure and debugging.


  • [1] H. Alwassel, D. Mahajan, L. Torresani, B. Ghanem, and D. Tran (2019) Self-supervised learning by cross-modal audio-video clustering. arXiv preprint arXiv:1911.12667. Cited by: §2.
  • [2] R. Arandjelovic and A. Zisserman (2017) Look, listen and learn. In ICCV, Cited by: §1, §2.
  • [3] R. Arandjelovic and A. Zisserman (2018) Objects that sound. In ECCV, V. Ferrari, M. Hebert, C. Sminchisescu, and Y. Weiss (Eds.), Cited by: §2.
  • [4] S. Benaim, A. Ephrat, O. Lang, I. Mosseri, W. T. Freeman, M. Rubinstein, M. Irani, and T. Dekel (2020) SpeedNet: learning the speediness in videos. In CVPR, Cited by: §1, §2.
  • [5] J. Carreira and A. Zisserman (2017) Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset. In CVPR, Cited by: §2.
  • [6] G. Chechik, V. Sharma, U. Shalit, and S. Bengio (2010)

    Large scale online learning of image similarity through ranking

    JMLR. Cited by: §1.
  • [7] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton (2020) A simple framework for contrastive learning of visual representations. arXiv preprint arXiv:2002.05709. Cited by: §1, §2.
  • [8] D. Damen, H. Doughty, G. M. Farinella, S. Fidler, A. Furnari, E. Kazakos, D. Moltisanti, J. Munro, T. Perrett, W. Price, and M. Wray (2018) Scaling egocentric vision: the epic-kitchens dataset. In

    European Conference on Computer Vision (ECCV)

    Cited by: Appendix B, §4.2.1.
  • [9] J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019-06) BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 4171–4186. External Links: Link, Document Cited by: §1, §2.
  • [10] B. Fernando, H. Bilen, E. Gavves, and S. Gould (2017)

    Self-supervised video representation learning with odd-one-out networks

    In CVPR, Cited by: §2.
  • [11] D. Ghadiyaram, D. Tran, and D. Mahajan (2019) Large-scale weakly-supervised pre-training for video action recognition. In CVPR, Cited by: §C.2, §C.2, Table 8, §2, §3.1, Table 1.
  • [12] R. Goyal, S. E. Kahou, V. Michalski, J. Materzynska, S. Westphal, H. Kim, V. Haenel, I. Fründ, P. Yianilos, M. Mueller-Freitag, F. Hoppe, C. Thurau, I. Bax, and R. Memisevic (2017) The "something something" video database for learning and evaluating visual common sense. In IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017, pp. 5843–5851. Cited by: §1.
  • [13] C. Gu, C. Sun, D. Ross, C. Vondrick, C. Pantofaru, Y. Li, S. Vijayanarasimhan, G. Toderici, S. Ricco, R. Sukthankar, C. Schmid, and J. Malik (2018) AVA: a video dataset of spatio-temporally localized atomic visual actions. In CVPR, Cited by: §1.
  • [14] T. Han, W. Xie, and A. Zisserman (2019) Video representation learning by dense predictive coding. In Proceedings of the IEEE International Conference on Computer Vision Workshops, pp. 0–0. Cited by: §1.
  • [15] K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick (2020) Momentum contrast for unsupervised visual representation learning. In CVPR, Cited by: §1, §2.
  • [16] D. Jayaraman and K. Grauman (2015) Learning image representations tied to ego-motion. In ICCV, Cited by: §2.
  • [17] L. Jing, X. Yang, J. Liu, and Y. Tian (2018) Self-supervised spatiotemporal feature learning via video rotation prediction. arXiv preprint arXiv:1811.11387. Cited by: §2.
  • [18] V. Karpukhin, B. Oguz, S. Min, L. Wu, S. Edunov, D. Chen, and W. Yih (2020) Dense passage retrieval for open-domain question answering. arXiv preprint arXiv:2004.04906. External Links: Link Cited by: §2.
  • [19] W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijayanarasimhan, F. Viola, T. Green, T. Back, P. Natsev, et al. (2017) The kinetics human action video dataset. arXiv preprint arXiv:1705.06950. Cited by: §1.
  • [20] B. Korbar, D. Tran, and L. Torresani (2018) Cooperative learning of audio and video models from self-supervised synchronization. In NeurIPS, Cited by: §1, §1, §2.
  • [21] H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre (2011) HMDB: a large video database for human motion recognition. In ICCV, Cited by: §1.
  • [22] H. Lee, J. Huang, M. Singh, and M. Yang (2017) Unsupervised representation learning by sorting sequences. In ICCV, Cited by: §2.
  • [23] J. Lei, L. Yu, M. Bansal, and T. L. Berg (2018) TVQA: localized, compositional video question answering. In EMNLP, Cited by: §4.2.4, Table 6.
  • [24] J. Lei, L. Yu, T. L. Berg, and M. Bansal (2019) TVQA+: spatio-temporal grounding for video question answering. External Links: 1904.11574 Cited by: Table 6.
  • [25] J. Lei, L. Yu, T. L. Berg, and M. Bansal (2020)

    TVR: a large-scale dataset for video-subtitle moment retrieval

    External Links: 2001.09099 Cited by: §4.2.2, Table 3.
  • [26] M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mohamed, O. Levy, V. Stoyanov, and L. Zettlemoyer (2019) BART: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461. External Links: Link Cited by: §2.
  • [27] T. Li and L. Wang (2020) Learning spatiotemporal features via video and text pair discrimination. arXiv preprint arXiv:2001.05691. Cited by: §2.
  • [28] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov (2019) Roberta: a robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692. Cited by: §2, §2.
  • [29] L. Massarelli, F. Petroni, A. Piktus, M. Ott, T. Rocktäschel, V. Plachouras, F. Silvestri, and S. Riedel (2019) How decoding strategies affect the verifiability of generated text. arXiv preprint arXiv:1911.03587. Cited by: §3.1.
  • [30] A. Miech, J. Alayrac, L. Smaira, I. Laptev, J. Sivic, and A. Zisserman (2020) End-to-end learning of visual representations from uncurated instructional videos. In CVPR, Cited by: §1, §2, §4.2.3, Table 5.
  • [31] A. Miech, D. Zhukov, J. Alayrac, M. Tapaswi, I. Laptev, and J. Sivic (2019) HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips. In ICCV, Cited by: Appendix B, §C.4, §1, §1, §2, §3.1, §4.2.3, Table 5.
  • [32] I. Misra and L. van der Maaten (2020) Self-supervised learning of pretext-invariant representations. In CVPR, Cited by: §2.
  • [33] I. Misra, C. L. Zitnick, and M. Hebert (2016)

    Shuffle and Learn: Unsupervised Learning using Temporal Order Verification

    In ECCV, Cited by: §1, §2.
  • [34] J. Y. Ng, J. Choi, J. Neumann, and L. S. Davis (2018) Actionflownet: learning motion representation for action recognition. In 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 1616–1624. Cited by: §1.
  • [35] R. Nogueira, Z. Jiang, and J. Lin (2020) Document ranking with a pretrained sequence-to-sequence model. arXiv preprint arXiv:2003.06713. Cited by: §3.1.
  • [36] A. v. d. Oord, Y. Li, and O. Vinyals (2018) Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748. Cited by: §1.
  • [37] A. Owens and A. A. Efros (2018) Audio-visual scene analysis with self-supervised multisensory features. In ECCV, Cited by: §1, §2.
  • [38] M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer (2018) Deep contextualized word representations. In NAACL, Cited by: §2.
  • [39] F. Petroni, P. Lewis, A. Piktus, T. Rocktäschel, Y. Wu, A. H. Miller, and S. Riedel (2020) How context affects language models’ factual predictions. arXiv preprint arXiv:2005.04611. Cited by: §4.2.4.
  • [40] A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever (2018) Improving language understanding by generative pre-training. Cited by: §2.
  • [41] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu (2019) Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv e-prints. External Links: 1910.10683, Link Cited by: §A.2, Appendix B, §C.1, §2, §2, §2, §3.1, §4.1.
  • [42] C. Schuldt, I. Laptev, and B. Caputo (2004) Recognizing human actions: A local svm approach. In ICPR, Cited by: §1.
  • [43] G. A. Sigurdsson, J. Alayrac, A. Nematzadeh, L. Smaira, M. Malinowski, J. Carreira, P. Blunsom, and A. Zisserman (2020) Visual grounding in video for unsupervised word translation. In CVPR, Cited by: Broader Impact.
  • [44] G. A. Sigurdsson, G. Varol, X. Wang, A. Farhadi, I. Laptev, and A. Gupta (2016) Hollywood in homes: crowdsourcing data collection for activity understanding. In ECCV, Cited by: §1.
  • [45] K. Soomro, A. R. Zamir, and M. Shah (2012) UCF101: A dataset of 101 human actions classes from videos in the wild. CRCV-TR-12-01. Cited by: §1.
  • [46] C. Sun, F. Baradel, K. Murphy, and C. Schmid (2019) Contrastive bidirectional transformer for temporal representation learning. arXiv preprint arXiv:1906.05743. Cited by: §1, §1, §2, Table 2.
  • [47] C. Sun, A. Myers, C. Vondrick, K. Murphy, and C. Schmid (2019) VideoBERT: A joint model for video and language representation learning. In ICCV, Cited by: §1, §1, §2, Table 2.
  • [48] Y. Tian, D. Krishnan, and P. Isola (2019) Contrastive multiview coding. arXiv preprint arXiv:1906.05849. Cited by: §2.
  • [49] D. Tran, H. Wang, L. Torresani, J. Ray, Y. LeCun, and M. Paluri (2018) A closer look at spatiotemporal convolutions for action recognition. In CVPR, Cited by: Appendix B, §C.2, §2, §3.1.
  • [50] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin (2017) Attention is all you need. In NeurIPS, Cited by: §3.1.
  • [51] C. Vondrick, A. Shrivastava, A. Fathi, S. Guadarrama, and K. Murphy (2018) Tracking emerges by colorizing videos. In ECCV, Cited by: §2.
  • [52] A. Wang, Y. Pruksachatkun, N. Nangia, A. Singh, J. Michael, F. Hill, O. Levy, and S. Bowman (2019) SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems. In Advances in Neural Information Processing Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. d. Alché-Buc, E. Fox, and R. Garnett (Eds.), pp. 3261–3275. External Links: Link Cited by: §2.
  • [53] A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman (2018) GLUE: A multi-task benchmark and analysis platform for natural language understanding. In

    Proceedings of the Workshop: Analyzing and Interpreting Neural Networks for NLP, BlackboxNLP@EMNLP 2018, Brussels, Belgium, November 1, 2018

    pp. 353–355. External Links: Link Cited by: §2.
  • [54] J. Wang, J. Jiao, L. Bao, S. He, Y. Liu, and W. Liu (2019) Self-supervised spatio-temporal representation learning for videos by predicting motion and appearance statistics. In CVPR, Cited by: §2.
  • [55] W. Wang, D. Tran, and M. Feiszli (2019) What makes training multi-modal networks hard?. CoRR abs/1905.12681. External Links: Link Cited by: Table 8, Table 1.
  • [56] X. Wang, Y. Wu, L. Zhu, and Y. Yang (2019) Baidu-uts submission to the epic-kitchens action recognition challenge 2019. CoRR abs/1906.09383. External Links: Link Cited by: Table 8, Table 1.
  • [57] D. Wei, J. J. Lim, A. Zisserman, and W. T. Freeman (2018) Learning and using the arrow of time. In CVPR, Cited by: §1, §2.
  • [58] L. Wu, F. Petroni, M. Josifoski, S. Riedel, and L. Zettlemoyer (2019) Zero-shot entity linking with dense entity retrieval. arXiv preprint arXiv:1911.03814. Cited by: §2.
  • [59] Z. Wu, Y. Xiong, S. X. Yu, and D. Lin (2018) Unsupervised feature learning via non-parametric instance discrimination. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    pp. 3733–3742. Cited by: §1.
  • [60] D. Xu, J. Xiao, Z. Zhao, J. Shao, D. Xie, and Y. Zhuang (2019) Self-supervised spatiotemporal learning via video clip order prediction. In CVPR, Cited by: §2.
  • [61] J. Xu, T. Mei, T. Yao, and Y. Rui (2016) MSR-VTT: A large video description dataset for bridging video and language. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, pp. 5288–5296. Cited by: §4.2.2.
  • [62] Z. Yang, N. Garcia, C. Chu, M. Otani, Y. Nakashima, and H. Takemura (2020) BERT representations for video question answering. In IEEE Winter Conference on Applications of Computer Vision, WACV 2020, Snowmass Village, CO, USA, March 1-5, 2020, pp. 1545–1554. Cited by: Table 6.
  • [63] Z. Zhang, Y. Shi, C. Yuan, B. Li, P. Wang, W. Hu, and Z. Zha (2020) Object relational graph with teacher-recommended learning for video captioning. CoRR abs/2002.11566. External Links: Link Cited by: Table 2.
  • [64] L. Zhou, C. Xu, and J. J. Corso (2018) Towards automatic learning of procedures from web instructional videos. In

    Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, (AAAI-18), the 30th innovative Applications of Artificial Intelligence (IAAI-18), and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence (EAAI-18), New Orleans, Louisiana, USA, February 2-7, 2018

    , S. A. McIlraith and K. Q. Weinberger (Eds.),
    pp. 7590–7598. Cited by: Appendix B, §4.2.2.


Appendix A Input pre-processing

In this section, we discuss the pre-processing details for video and text.

a.1 Video pre-processing

Our video network takes as input RGB frames of size . We apply standard data augmentation transformation (multi-scale random crop, random horizontal flip, and Z normalization) to all input videos at training time. At inference time, we do not flip the videos and use centre crop as apposed to multi-scale random crop.

Since the datasets considered in our experiments are encoded at different frames-per-second (FPS) and have varying lengths, we adopt dataset-specific strategies to form the video input.

HowTo100M, YouCook2, MSR-VTT:

We do not apply any normalization of the videos themselves. Before loading the annotated segment into memory, we only check if it is longer than 6 seconds and if so, If the sequence is longer than 6 seconds, we randomly select a 6-second segment within the sequence to prevent out-of-memory errors during decoding.


This dataset is provided as a series of frames extracted at 3 frames per second. In order to form the video input, we load the frames of a sequence into memory, repeating each frame twice (making the frame rate effectively 6 FPS).


This dataset is given as a series of high-resolution (up to pixels on the shorter side), high frame-rate (up to 90 FPS) GoPro videos. In order to reduce the decoding cost, we re-sample each video to 24 FPS at the lowest common resolution of pixels.

a.2 Text pre-processing

We tokenize the text prior to the input, completely following the procedure from Raffel et al. (2019). We append the <CLS> token to the beginning of the input, and the <SEP> token to the end of the masked text input, before additional context if one is used. Note that we do not remove stop words from the HowTo100M annotations.

Appendix B Study of design choices

Here we report the results of an empirical study aimed at determining the effects of different design choices on downstream performance. In order to reduce the computational cost of this broad evaluation, we perform training using the smaller t5-small transformer on a subset of HowTo100M consisting of 200k randomly sampled videos. All models are trained for 240k iterations. We measure effects on the downstream tasks of EPIC Kitchens Damen et al. (2018) (EK) classification (top-1 accuracy %) and YouCook2 Zhou et al. (2018) retrieval (R@10 %), without any fine-tuning on these benchmarks (i.e., in a zero-shot setting). Table 7 provides the quantitative results of this study which we summarize next.

Video input. We experimented with 32-frame inputs obtained by either (a) sampling the frames with uniform spacing to span the entire clip or (b) sampling 32 consecutive frames at 16 fps from a random starting time of the clip. The former option yields better results (a gain of 1.1% in accuracy on EK and of 3.4% in R10 on YouCook2).

Video feature pooling. We found that applying temporal pooling to the video features degrades downstream performance by 2.5% on EK and 1.4% on YouCook2.

Input text perturbations. As mentioned, training a model to generate text from video alone is a very hard task. However simply providing the associated narration as input only requires an identity to correctly generate the target. Thus, during training we present the model with perturbed text. Following Raffel et al. (2019), we evaluate masking and shuffling as viable text perturbations. Training with video features only yields low performance (0.9% and 2.3% on EK and YouCook2 respectively), and adding unmodified text significantly eases the task but does not improve the performance on the downstream tasks (1.3% and 8.9%). Training with perturbed text gives a boost in performance on the downstream tasks compared to training with full text or no text at all. When input text is shuffled, we obtain 8.6% and 11.4% on EK and YouCook2, and when it is masked it further boosts the results to 11.9% and 17.2%. Finally, we found that training with both form of text perturbation yields the best downstream performance (14.3% and 19.3%).

Multi-task training. We found that training our model to perform both captioning and classification on HowTo100M Miech et al. (2019) results in better downstream accuracy compared to single-task training (+2.2% on EK and +8.1% on YouCook2).

Training with frozen video embeddings. Due to the computational costs of end-to-end training of video models, most approaches use fixed video or image features as an input. We found that training the video encoder jointly with the transformer is crucial in our model. Training the transformer with a frozen R(2+1)D Tran et al. (2018) model fails to converge and produces consistently worse performance on all tasks (0.5% on EK and 3.2% on YouCook2).

Transformer size. We found that the performance on the downstream tasks improves as we increase the number of transformer parameters. Due to memory constraint, the largest transformer we use is a t5-large model with 770M parameters, which outperforms on every task the identical setup with the t5-base model (220M parameters).

Video input
a) 32 frames uniformly spaced 19.3 14.3
b) 32 consecutive frames 15.9 13.2
(a) Video input.
Yes 17.9 11.8
No 19.3 14.3
(b) Video feature pooling.
None 2.3 0.9
Unmodified 8.9 1.3
Shuffled 11.4 8.6
Masked 17.2 11.9
Shuffled & Masked 19.3 14.3
(c) Text perturbation ablation.
Training task
Captioning 9.4 9.8
Classification 11.2 12.1
Multi-task 19.3 14.3
(d) Training task.
Fixed 3.2 0.5
Finetune last layer only 18.2 11.9
Finetune all 19.3 14.3
(e) Video embedding.
size (params)
Small (60M) 19.3 14.3
Base (220M) 19.5 15.1
Large (770M) 19.9 16.7
(f) Transformer size.
Table 7: Effects of design choices on downstream tasks.

Appendix C Additional experimental details

Here we present task-specific hyper-parameters, details about dataset splits, and additional training information. In general, all learning rates (LR) are scaled according to the total number of GPUs in a distributed training run.

c.1 General hyper-parameters

For our discriminative finetuning, we use LR decay of

at given intervals. We use an equivalent of 1 epoch of iterations as a linear warm-up wherever we use multiple nodes for finetuning.

For generative finetuning we largely follow the procedure in Raffel et al. (2019), i.e., we finetune the entire encoder-decoder model for steps with a constant learning rate. The only deviations from the procedure described in Raffel et al. (2019) are: 1) using iterations as a linear warm-up and 2) setting the finetuning learning rate to .

c.2 Action classification - EPIC-Kitchens

Dataset splits.

We optimize our hyper parameters on the same validation set of unseen kitchens used in Ghadiyaram et al. (2019): videos by person 01 to 25 ( segments) are used for training, and the remaining ones are used for validation ().

Finetuning parameters - baseline model.

For the finetuning of the baseline R(2+1)D-34 Tran et al. (2018) model we use the same hyper-parameters as in Ghadiyaram et al. (2019): the total training amounts to 27 epochs with a batch size of 6 clips per GPU. The base learning rate of is decayed every 9 epochs. We found that using 1 epoch for a linear warm-up helps with the consistency of the experiments.

Finetuning parameters - VideoTranslate.

For discriminative finetuning, we use a LR of for the video feature extractor and the transformer encoder, and a LR of for the linear layers trained on top of the representations. All layers are trained for iterations, and the learning rates are scaled down every iterations.

Top-5 Accuracy Results

In Table 8 we present the EK recognition results in terms of top-5 accuracy. All results are computed as an average of predictions on 10 uniformly sampled clips from every video.

Validation Test-S1 Test-S2
Method Finetuning Verb Noun Action Verb Noun Action Verb Noun Action
T1 T5 T1 T5 T1 T5 T1 T5 T1 T5 T1 T5 T1 T5 T1 T5 T1 T5
R(2+1)D-34 yes 56.0 80.2 34.8 58.0 24.9 41.9 63.2 87.4 46.0 69.6 34.1 54.0 55.1 80.5 33.2 56.1 23.5 38.7
R(2+1)D-152 Ghadiyaram et al. (2019) yes 57.3 81.1 35.7 58.7 25.6 42.7 65.2 87.4 45.1 67.8 34.5 53.8 58.4 84.1 36.9 60.3 26.1 42.7
GBlendWang et al. (2019c) yes 59.2 84.5 36.1 58.5 25.6 43.5 66.7 88.9 48.5 71.7 37.1 56.2 58.3 81.3 36.7 60.3 26.6 43.6
BAIDU Wang et al. (2019d) yes 63.2 84.6 39.1 65.0 29.0 49.8 69.8 91.0 53.3 76.7 41.4 63.6 59.7 82.7 34.2 62.4 25.1 46.0
Ours - 0shot no 53.3 75.5 34.0 58.0 24.6 41.8 57.8 78.5 35.1 59.8 24.9 42.6 52.6 74.2 34.1 58.7 24.8 42.0
Ours yes:gen 56.3 79.6 35.6 58.5 25.7 42.8 63.8 86.9 44.0 67.1 34.0 53.0 57.2 82.6 36.2 59.1 26.3 42.8
Ours+gen yes: gen-MT 58.0 82.6 36.5 60.1 26.0 43.7 65.4 87.9 46.0 69.3 37.0 54.0 58.5 83.7 36.9 59.9 26.8 43.6
Ours yes:disc 59.4 83.9 37.8 59.0 27.8 45.1 66.1 88.5 48.4 69.9 37.2 55.8 58.6 84.1 37.0 60.5 26.9 44.5
Table 8: Action classification: comparison to the state-of-the-art on EPIC-Kitchens.

c.3 Captioning - TVC/YouCook2/MSR-VTT

We finetune our model for each dataset separately. We use the same general hyper-parameters for each of the datasets (they are all fine-tuned in a generative fashion).

c.4 Retrieval - YouCook2/MSR-VTT:

we fine-tune these models for captioning following the general procedure above, but on the dataset splits provided by Miech et al. (2019).