Joint vision-and-language understanding sits at the nexus of computer vision and natural language processing (NLP), and has attracted rapidly growing attention from both communities. Popular tasks include visual question answering[4, 20], referring expression comprehension [69, 68], visual dialog , visual reasoning [27, 52, 25], visual commonsense reasoning , NLVR , and visual entailment . The emergence of these diverse Vision+Language tasks, benchmarked over large-scale human annotated datasets [39, 34], has driven tremendous progress in joint multimodal embedding learning [53, 42, 10, 51]. However, most of these datasets and models were centered on static images, leaving the joint modeling of video and its aligned textual information (e.g., video-and-language understanding) a relatively under-explored territory.
Video Question Answering (Video QA) is one of the most popular tasks in current studies for video-and-language understanding. Video QA model aims to answer a natural language question given a video clip. Existing Video QA datasets include MovieFIB , MovieQA , TGIF-QA , PororoQA , and TVQA [35, 36]. While these datasets have covered a rich pool of video content (e.g., cartoons, short GIFs and TV shows), they are limited to QA task only. On the other hand, in NLP field, one important benchmark for natural language understanding is natural language inference (NLI) [5, 60], where a model is presented with a pair of sentences (premise and hypothesis), and judges the relationship between the pair (e.g., Contradiction, Neutral, and Entailment).
Inspired by NLI, we present a novel task, Video-and-Language Inference, to foster deeper investigations in video-and-language understanding. Specifically, given a video clip with aligned subtitles as premise, and a natural language statement as a hypothesis describing the video content, a model is expected to infer whether the statement is entailed or contradicted by the given video clip. This new task is easy to evaluate, since only binary classification is measured; but also challenging to solve, as a thorough interpretation of both visual and textual clues is required in order to achieve in-depth understanding and inference for a complex video scenario.
We introduce a large-scale dataset for this new task, VIdeO-and-Language INference (Violin)222Project page: https://github.com/jimmy646/violin., built upon natural video content with rich temporal dynamics and social interactions. Video clips are collected from diverse sources to cover realistic visual scenes, and statements are collected from crowdsource workers via Amazon Mechanical Turk (AMT)333https://www.mturk.com/, who watched the videos accompanied by subtitles (dialogue, scene description, etc). Our goal is to provide a dataset that can test a model’s cross-modality reasoning skills over both video and textual signals. To this end, we require AMT workers to write statements based on joint understanding of both video and subtitles, which not only describe explicit information in the video (e.g., objects, locations, characters, social activity), but also reveal in-depth comprehension of complex plots (e.g.
, interpreting human emotions and relations, understanding the events, inferring causal relations of events throughout the video). This distinguishes our collected statements from the straightforward captions in video/image captioning dataset[39, 33, 59], which are dominated by explicit factual descriptions without deeper inference.
Writing negative statements for an inference task is challenging [5, 72]. To gather high-quality negative statements without artificial cues or biased priors, we employed two strategies in the data collection: () requiring annotators to write negative statements by changing just a few words or phrases in a positive statement, to ensure that the style and length of the statement remain unchanged; () performing adversarial matching : for each video, select challenging and confusing statements from the statement pool of other videos as the negative ones. The first strategy ensures the collected statements can test a model’s in-depth inference ability, since only a small fraction of a positive statement is modified, which requires the model to distinguish highly similar statements with different meanings. The second strategy focuses more on testing a model’s global understanding of the video, to distinguish statements with high-level scene difference between videos. When combined together, these two strategies produce a dataset with minimal visual or textual bias. Through this effort, we collected 95,322 video-statement pairs, containing 15,887 video clips spanning over 582 hours. Each video is paired with 6 statements and is 35.2 seconds long on average.
The main contributions of this paper are three-fold. () We propose a new task, Video-and-Language Inference, which requires a model to draw inference on whether a written statement entails or contradicts a given video clip. () We introduce a new dataset Violin for this task, providing a reliable benchmark for measuring joint video-and-language understanding models. () We provide a detailed analysis of the Violin dataset with evaluation over strong baselines, and suggest future directions for this new task.
|source||# episodes||# clips||avg clip len||avg pos. statement len||avg neg. statement len||avg subtitle len|
|How I Met Your Mother||207||1,944||31.64s||18.08||18.06||76.78|
2 Related Work
Natural Language Inference (NLI)
Understanding entailment and contradiction relations between sentences (i.e., Natural Language Inference) is fundamental to natural language understanding. Several large-scale datasets have been developed as NLI benchmarks, such as SNLI  and MultiNLI . NLI is also included in the GLUE benchmark for evaluating general language understanding . Recent introduction of large-scale pre-trained language models, such as BERT , XLNet , and RoBERTa , has propelled significant progress in NLI. Multi-task learning and adversarial training [40, 73] also prove to be helpful in improving model performance.
Inspired by NLI, we propose the task of Video-and-Language Inference to evaluate a system’s multimodal reasoning ability. However, different from NLI, our task is more challenging in the sense that both video and text (subtitles) are provided; thus, a thorough joint understanding of both modalities is required for inference.
Visual Entailment Visual Entailment (VE)  is a recently proposed task that extends NLI to the visual domain. In this task, a natural image premise and a natural language hypothesis are given, and the goal is to judge whether the textual hypothesis can be confirmed based on the visual content in the image. Three labels are assigned: Entailment, Neutral, and Contradiction. The dataset is created based on Flickr30k image captions  and SNLI . Similarly, NLVR  is proposed to investigate the grounding relationship between given images and a natural language description.
Our proposed task is different from VE in the following aspects. () VE considers images as input, while our task focuses on videos instead. Compared with static images, videos contain complex temporal dynamics, making the video-and-language inference task more challenging as the model needs to understand the relationship between different visual scenes to draw inference. () Our proposed task requires deeper visual understanding. Images in the VE task are mostly natural images, while the videos in Violin were collected from popular TV shows and movie clips, which contain rich social interactions and diverse scenes. This requires a model to not only understand explicit visual cues, but also infer in-depth rationale behind the scene. () Our task requires more sophisticated language understanding. VE is a combination of Flickr30k  and SNLI , with no crowdsouring involved. The hypotheses in VE task are composed of captions only, containing factual descriptions that can be explicitly derived from the visual content in the image. On the other hand, Violin mainly consists of implicit statements that cannot be solved without in-depth understanding of the video and text, designed specifically to evaluate a model’s multimodal reasoning skills.
Video-and-Language Research With the emergence of large-scale video datasets [6, 1, 29, 11, 58], several video-and-language tasks have been proposed, such as video captioning [21, 56, 62, 18, 33, 16, 47, 59], localizing video segments from natural language queries [19, 3, 8, 37], video reasoning , and video question answering [54, 35]
. Video captioning is a conditional text generation task, while the other three belong to video-and-language understanding. In particular, MovieQA, TGIF-QA  and TVQA [35, 36], which contain real-world videos and human-generated questions, are recently proposed for video question answering.
Our Violin dataset also uses TV shows as one of the video sources, similar to TVQA . The main differences are summarized as: () Our dataset contains richer video content, including 5,885 movie clips in additional to TV shows used in TVQA. () Our dataset requires more sophisticated reasoning skills from a model, such as inferring reasons and interpreting human emotions, while most QA pairs in TVQA are focused on identifying explicit information.
Visual Question Answering Our proposed task is also related to Visual Question Answering (VQA) [4, 20]. The CLEVR dataset  serves as a popular synthetic diagnosis dataset that tests a model’s compositional reasoning skills. Recently, GQA  was introduced to benchmark real-world visual reasoning, and VCR  for visual commonsense reasoning.
Many neural network models have been proposed for these tasks, such as more advanced attention mechanisms[64, 43, 70], better multimodal fusion methods [15, 71, 31, 30], the use of multi-step reasoning [24, 17, 7], the incorporation of relations [49, 38, 45], and neural module networks for compositional reasoning [2, 28, 23, 9]. Our proposed task can provide a new perspective for benchmarking these models.
3 Video-and-Language Inference Dataset
|Dataset||Visual Domain||Source||Subtitles||Inference||Task||# images/videos||# samples|
|TVQA ||video||TV show||✓||✗||QA||21.8K||152.5K|
|Violin (ours)||video||TV show/movie||✓||✓||Entailment||15.9K||95.3K|
In our Violin dataset for video-and-language inference, the input is a video clip consisting of a sequence of video frames , paired with its aligned text ( is the subtitle within time span in the video) and a natural language statement as the hypothesis aiming to describe the video clip. For every triplet, a system needs to perform binary classification: , deciding whether the statement is entailed (label ) from or contradicts (label ) the given video clip. In order to increase the coverage and versatility, we collect the videos from diverse sources, including 4 popular TV shows of different genres and YouTube movie clips from thousands of movies. To ensure high video quality, we also provide carefully-designed protocols to guide crowdsource workers to select representative video segments for which to write positive/negative statements. The procedure of dataset collection is detailed in Sec. 3.1, and Sec. 3.2 provides a thorough analysis on the dataset.
3.1 Dataset Collection
We collect videos from two sources: () 4 popular TV shows, and () movie clips from YouTube channels444https://www.youtube.com/user/movieclips covering thousands of movies. Both sources contain rich human interactions and activities. Each episode of the TV shows is 20-40 minutes long, which we split into clips of 90 seconds long (while avoiding splitting dialogues in the middle). These 90 second-long clips may contain more than one scene, which are then presented to crowdworkers to select a video segment containing a single, self-contained scene for which they can write the statements. Additionally, we restrict the length of the selected interval to 15-40 seconds long, to maintain a reasonable difficulty level for the task. For movie clips from YouTube channels, the original lengths are around two minutes, which by nature usually contain only one scene of the movie. Thus, there is no need for the workers to manually select a video segment from the provided movie clips. We just select the first 40 seconds from every movie clip for annotation, to keep it consistent with TV show clips. Figure 2 shows the interface for AMT workers. By dragging the slider below the video player, users can adjust the start and end timestamps of the segment they want to select (for movie clips the slider is disabled).
After video segments are selected, they are presented to another group of annotators to write positive/negative statements. Each worker is assigned with one video clip, and is required to write three pairs of positive/negative statements describing the video (in the text boxes in Figure 2). We do not require AMT workers to follow any templates, as our goal is to collect diversified and natural expressions. We do have several rules/guidelines for writing positive statements: () We do not allow annotators to refer to characters in the video by name. Instead, they should use grounded referring expressions (e.g., “the man with blonde hair wearing grey shirt”, “the girl sitting in the sofa holding a cup of coffee”). The purpose of this is to keep the dataset consistent across different video sources (not all video clips have character names), and to reduce potential bias (in TV shows, the number of character names is very small). () We ask workers to keep to a minimum level of copying from subtitles (e.g., “somebody says …”) or describing explicit visual information (e.g., object, color), and encourage them to write statements combining information from both the video clip and subtitles. () We encourage workers to write about different aspects of the given video clip in different statement pairs, which may require different types of reasoning, such as inferring character emotions/relations/intentions and inferring causal relations in complex events.
In practice, we observe that when letting human annotators write negative statements without any constraint, the resulting statements show serious bias (i.e.
, models can learn to classify positive/negative statements without even absorbing information from the video or subtitles). When intentionally writing fake content without any reference, humans tend to use subtle patterns that statistical models can easily pick up. Therefore, when collecting negative statements, we propose two strategies to alleviate the bias issue. First, we ask annotators to use a positive statement as reference, and only modify a small portion of it to make it negative. In this case, most part of the statement remains true to the video content, and human-introduced bias is kept to minimum. This rigorous setting makes the statements more challenging to distinguish by the model, and in-depth reasoning is required to identify the fake content. For quality control, only workers located in English-speaking countries with a lifetime task approval rate greater than 98% can participate in our study. Also, during data collection, we manually check every worker’s submissions to ensure the quality of the video segments and statements.
VCR  proposes adversarial matching to construct wrong answers for multiple-choice QA, by selecting a correct answer (from another question) that is most similar to the current question. In our task, we use a similar strategy. For a human-generated positive statement for video , we select a positive statement collected for another video , which is most similar to , and use as a pair of positive/negative statements for video . Using this strategy, a portion of the collected statements serve as both positive and negative samples, which helps removing artificial bias. Unlike the first strategy aforementioned, statement pairs constructed this way focus more on the global understanding of the video. For example, in Figure 1, the first two negative statements are written by modifying positive statements (the modified part is marked in red), and the third negative statement is obtained by adversarial matching. In the final dataset, of the negative statements are constructed following the first strategy, and the remaining with the second strategy.
3.2 Dataset Analysis
The Violin dataset contains 15,887 video clips, and each video clip is annotated with 3 pairs of positive/negative statements, resulting in 95,322 triplets in total. Statistics on the full dataset is provided in Table 1. Each statement has 18 words on average, and the lengths of positive and negative statements are almost the same, showing no significant bias in length.
As discussed in Sec. 3.1, we use two strategies to collect negative statements: one is adversarial matching that tests a model’s ability of global video understanding; the other is modifying a small part of a positive statement for the video clip, which requires in-depth reasoning skills for a model to distinguish between positive and negative statements. To investigate in more detail, for each pair of positive and negative statements, we categorize it into 6 types of reasoning skills required, as shown in Figure 3. The types of “visual recognition”, “identifying character”, and “action recognition” are more focused on explicit information and require relatively low-level reasoning. “Human dynamics” includes inferring human emotions/relations/intentions, etc. “Conversation reasoning” requires performing inference over characters’ dialogues and other forms of interactions (body language, hand gestures, etc.). And “inferring reasons” is about inferring causal relations in complex events. These 3 types of statement require in-depth understanding and commonsense reasoning. Overall, “explicit information recognition” makes up 54% of the dataset, and “commonsense reasoning” makes up the remaining 46%, making our dataset a balanced one, imposing new challenges on multi-facet video-and-language understanding. Compared to other datasets, our Violin dataset is more focused on reasoning rather than surface-level grounding (e.g., in TVQA , only 8.5% of the questions require reasoning).
In this section, we introduce our baseline model used for benchmarking the Violin dataset and evaluating the effectiveness of different feature choices. An overview of the model is illustrated in Figure 4.
4.1 Video and Text Encoders
We first extract a sequence of visual features from video frames as , where is the number of time steps, and is the dimension of each feature. Choices of visual features will later be discussed in Sec. 5.1. The video encoder is implemented by a bi-directional LSTM, to capture the temporal correlation among consecutive frames. By passing video features into the video encoder and stacking hidden states from both directions, we obtain the video representations as , where is the hidden-state dimension of the LSTM encoder.
Statements and subtitles share the same text encoder. Statements are tokenized into a word sequence . Each line in the subtitle is tokenized, and all the lines are concatenated together into one single word sequence . Here, and are the lengths of statement and subtitle, respectively. We experiment with two types of text encoder: LSTM encoder and BERT  encoder. For LSTM encoder, every word token is converted to its word embedding and then fed to the LSTM encoder, producing text representations and . For BERT encoder, we use pre-trained BERT-base model, finetuned on Violin training statements and subtitles. The output of BERT encoder at each position is 768-dimensional, which is then projected to dimensions, also denoted as and .
4.2 Combining Multimodality Streams
The model takes three streams of information as input: video, subtitles and statement. The goal is to determine whether the statement entails or contradicts with the video and subtitles. In our model, statement representations are jointly modeled with video and subtitles via a shared fusion module. The fusion module is implemented with bidirectional attention, adopted from [50, 67, 35], where it is used for query-context matching. For simplicity, we only describe the process of combining the video and the statement streams. Subtitles and statement are fused in a similar way. Statement representations are used as context, and video representations as query. Each word in the statement thus attends to every time step in the video representations. Let be attention weights for the -th word in the statement, for all , . The output is a video-aware statement representation: . Similarly, we combine subtitles and statement streams to obtain a subtitle-aware statement representation . These two sets of representations are further fused via:
where stands for element-wise product. The resulting matrix
combines information from all three modality streams, which is then fed into another bidirectional LSTM. The last hidden states from both directions are concatenated and passed through a fully-connected layer with 1-dimensional output followed by a sigmoid activation function, predicting the probability of the input statement being positive.
The proposed baseline model is similar to the one in . The main difference is that our model uses statement representations as context and video/subtitle representations as query in the fusion module. The intuition is that, in our video-and-language inference task, the full statement needs to be supported by evidence from either the video or subtitles, in order to judge the statement to be positive/negative, instead of just locating the position in the video/subtitles that is most relevant to the query (as in TVQA ). Thus, in our model, every word in the statement is attended to the video and subtitles in the fusion module, then combined and fed to the final bi-LSTM to make the prediction.
For evaluation, we compare our model with several baselines on the dataset and provide detailed analysis on the results. In all the experiments, we split the Violin dataset into 80% for training (76,122 triplets), 10% for validation (9,600 triplets) and 10% for testing (9,600 triplets). Model performance is evaluated via binary classification accuracy.
5.1 Compared Models
First, we define the following combinations of input sources, to evaluate the importance of different modality streams:
Statements Only: Using statements only, without absorbing information from video or subtitles. This option is to test the innate bias of positive/negative statements.
Video: Using video features only.
Subtitles: Using subtitles only.
Video+Subtitles: Using both video and subtitle features, which is the full setting for the task.
Single Frame+Subtitles: Using subtitle features plus only one middle frame from the video. This option is to test the usefulness of temporal information in the video.
Different visual features are also evaluated on the Violin task: () Image feature: we use ResNet101 
trained on ImageNet to extract the global image feature for each frame; (
) C3D feature: we use 3-dimensional convolutional neural network (C3D) to extract video features; () Detection feature: we run Faster R-CNN  trained on Visual Genome  to detect objects in each frame and use their regional features as the input. For image features, we first down-sample each video to 3 frames per second, and then extract the 2048-dim feature for each frame. Similarly, for detection features, we use the same sampling rate and extract features followed by a pooling layer outputting the 2048-dim feature for each frame. For C3D features, we extract 4096-dim features for every 16 frames on the original video (without down-sampling). To encode text input as features, we use () pre-trained BERT-base model  finetuned on Violin statements and subtitles in the training set, and () GloVe  embeddings. For thorough evaluation, we also test a large-scale pre-trained model LXMERT  that jointly learns multimodal features.
5.2 Experimental Results
Table 3 summarizes results from baseline methods and our proposed model (using full-length video clips, subtitles and statements). We also run a set of experiments with different visual/text features and compare the results in Table 3.
Baseline Comparison Row 0 is the random guess baseline with an accuracy of 50%. When using only the statement to decide whether itself is positive or negative, the best model with BERT features only achieves 54.20, presenting little bias in the dataset. By adding subtitles or video, all the models obtain significant gains over the “statement only” versions. Notably, Stmt+Subtt with BERT and Stmt+Vis with Det+BERT achieve 66.05 (row 4) and 59.45 (row 10), respectively. From row 3-4 and 12-17, we can observe that adding subtitles improves the performance significantly. However, the gain of adding video (row 5-10 and 12-17) is not as significant as adding subtitles. This might be due to visual features not capturing video information well. Using only one frame as video features (row 11) is worse than using full video (row 13), showing the importance of exploiting temporal information in the video. Overall, the best performance is achieved by using all the sources, with BERT and Detection features (row 17).
Model Variants We first evaluate the effectiveness of different visual features. In most settings, Detection features work better than Image and C3D features, indicating that the extracted regional information and external knowledge from Visual Genome are useful for this task. Among all the textual features, BERT  is the strongest as expected. In all the settings, BERT-based versions generally improve the accuracy by 3% to 6%, compared with non-contextualized embedding such as GloVe . Joint multimodal embedding (LXMERT, row 18) achieves 66.25, which is slightly worse than the best baseline model (row 17), showing that Violin imposes more challenges on existing single-image-based joint pre-trained models.
Human Evaluation Human performance via AMT is presented in Table 4. As expected, humans achieve the best performance when provided with both video and subtitles (85.20)555We repeated the human evaluation ourselves, and the accuracy is 93%.. Without context (video and subtitles), humans only achieve 51.38% accuracy. Interestingly, we find that adding video brings in more gain than adding subtitles, showing the importance of visual information in Violin task.
|Source||Test Accuracy (%)|
|Subtitle + Statement||73.85|
|Video + Statement||77.19|
5.3 Further Analysis
Accuracy on Different Question Types To have a better understanding of the dataset, we examine the accuracy of models on different statement types on test set in Table 6. Compared to Stmt+Subtt, Stmt+Subtt+Vis models improve mostly on “visual recognition” and “action recognition”. For categories such as “inferring reasons” and “identify character”, including video gains some improvement. On “conversation reasoning” and “human dynamics”, adding video features does not help.
Human-Written vs. Adversarially-Sampled Negatives For comparison, we create a new statement set by replacing the adversarially-sampled negative statements with original human-written negative statements. Results are presented in Table 5. Performance on the sampled negatives is higher than that on human-written ones. Our interpretation is that human-written content has higher propensity for intent understanding and in-depth reasoning, which makes the statements more challenging to the model.
Qualitative Analysis Figure 5 presents some prediction examples from our model using statement, video and subtitles. The correct cases in Figure 5 (a) demonstrate the model’s ability to recognize action, infer emotion, identify referred person, and understand temporal dynamics in the video. In (b), the error cases show that our model does not work well on inferring reasons and human relations.
We introduce a new task, video-and-language inference (Violin), which requires intelligent systems to capture rich temporal signals about activities/events in video and text, in order to acquire reasoning skills for multimodal inference. We provide thorough baseline experiments for benchmarking different models on the large-scale dataset, as well as a comprehensive analysis of the dataset. The gap between the baseline models and human performance is significant. We encourage the community to participate in this task and invent stronger methods to push the state of the art on multimodal inference. Possible future directions include developing models to localize key frames, as well as better utilizing the alignment between video and subtitles to improve reasoning ability.
Acknowledgement We would like to thank Yandong Li, Liqun Chen, Shuyang Dai, Linjie Li, Chen Zhu, Jiacheng Xu and Boyi Li for providing useful feedback on the project and their help in collecting and annotating data. We thank all the reviewers for their helpful comments. The first author is supported in part by NSF under grant IIS-1546329.
-  (2016) Youtube-8m: a large-scale video classification benchmark. arXiv preprint arXiv:1609.08675. Cited by: §2.
-  (2016) Neural module networks. In CVPR, Cited by: §2.
Localizing moments in video with natural language. In ICCV, Cited by: §2.
-  (2015) Vqa: visual question answering. In ICCV, Cited by: §1, §2.
-  (2015) A large annotated corpus for learning natural language inference. In EMNLP, Cited by: §1, §1, §2, §2, §2.
-  (2015) Activitynet: a large-scale video benchmark for human activity understanding. In CVPR, Cited by: §2.
-  (2019) Murel: multimodal relational reasoning for visual question answering. In CVPR, Cited by: §2.
-  (2018) Temporally grounding natural sentence in video. In EMNLP, Cited by: §2.
-  (2019) Meta module network for compositional visual reasoning. arXiv preprint arXiv:1910.03230. Cited by: §2.
-  (2019) UNITER: learning universal image-text representations. arXiv preprint arXiv:1909.11740. Cited by: §1.
-  (2014) Temporal sequence modeling for video event detection. In CVPR, Cited by: §2.
-  (2017) Visual dialog. In CVPR, Cited by: §1.
-  (2009) ImageNet: A Large-Scale Hierarchical Image Database. In CVPR, Cited by: §5.1.
-  (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §2, §4.1, §5.1, §5.2.
-  (2016) Multimodal compact bilinear pooling for visual question answering and visual grounding. arXiv preprint arXiv:1606.01847. Cited by: §2.
-  (2017) Stylenet: generating attractive visual captions with styles. In CVPR, Cited by: §2.
-  (2019) Multi-step reasoning via recurrent dual attention for visual dialog. arXiv preprint arXiv:1902.00579. Cited by: §2.
-  (2017) Semantic compositional networks for visual captioning. In CVPR, Cited by: §2.
-  (2017) Tall: temporal activity localization via language query. In ICCV, Cited by: §2.
-  (2017) Making the v in vqa matter: elevating the role of image understanding in visual question answering. In CVPR, Cited by: §1, §2.
-  (2013) Youtube2text: recognizing and describing arbitrary activities using semantic hierarchies and zero-shot recognition. In ICCV, Cited by: §2.
-  (2015) Deep residual learning for image recognition. arXiv preprint arXiv:1512.03385. Cited by: §5.1.
-  (2017) Learning to reason: end-to-end module networks for visual question answering. In ICCV, Cited by: §2.
-  (2018) Compositional attention networks for machine reasoning. arXiv preprint arXiv:1803.03067. Cited by: §2.
-  (2019) GQA: a new dataset for compositional question answering over real-world images. In CVPR, Cited by: §1, §2, Table 2.
-  (2017) Tgif-qa: toward spatio-temporal reasoning in visual question answering. In CVPR, Cited by: §1, §2.
-  (2017) Clevr: a diagnostic dataset for compositional language and elementary visual reasoning. In CVPR, Cited by: §1, §2.
-  (2017) Inferring and executing programs for visual reasoning. In ICCV, Cited by: §2.
-  (2017) The kinetics human action video dataset. arXiv preprint arXiv:1705.06950. Cited by: §2.
-  (2018) Bilinear attention networks. In NeurIPS, Cited by: §2.
-  (2016) Hadamard product for low-rank bilinear pooling. arXiv preprint arXiv:1610.04325. Cited by: §2.
-  (2017) Deepstory: video story qa by deep embedded memory networks. In IJCAI, Cited by: §1.
-  (2017) Dense-captioning events in videos. In ICCV, Cited by: §1, §2.
-  (2017) Visual genome: connecting language and vision using crowdsourced dense image annotations. IJCV. Cited by: §1, §5.1.
-  (2018) Tvqa: localized, compositional video question answering. In EMNLP, Cited by: §1, §2, §2, §3.2, Table 2, §4.2, §4.2.
-  (2019) TVQA+: spatio-temporal grounding for video question answering. arXiv preprint arXiv:1904.11574. Cited by: §1, §2.
-  (2020) TVR: a large-scale dataset for video-subtitle moment retrieval. arXiv preprint arXiv:2001.09099. Cited by: §2.
-  (2019) Relation-aware graph attention network for visual question answering. arXiv preprint arXiv:1903.12314. Cited by: §2.
-  (2014) Microsoft coco: common objects in context. In ECCV, Cited by: §1, §1.
-  (2019) Multi-task deep neural networks for natural language understanding. arXiv preprint arXiv:1901.11504. Cited by: §2.
-  (2019) Roberta: a robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692. Cited by: §2.
-  (2019) Vilbert: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In NeurIPS, Cited by: §1.
-  (2016) Hierarchical question-image co-attention for visual question answering. In NeurIPS, Cited by: §2.
-  (2017) A dataset and exploration of models for understanding video data through fill-in-the-blank question-answering. In CVPR, Cited by: §1, Table 2.
-  (2018) Learning conditioned graph structures for interpretable visual question answering. In NeurIPS, Cited by: §2.
Glove: global vectors for word representation. In EMNLP, Cited by: §5.1, §5.2.
-  (2018) Adaptive feature abstraction for translating video to text. In AAAI, Cited by: §2.
-  (2015) Faster R-CNN: towards real-time object detection with region proposal networks. In NeurIPS, Cited by: §5.1.
-  (2017) A simple neural network module for relational reasoning. In NeurIPS, Cited by: §2.
-  (2016) Bidirectional attention flow for machine comprehension. arXiv preprint arXiv:1611.01603. Cited by: §4.2.
-  (2019) Vl-bert: pre-training of generic visual-linguistic representations. arXiv preprint arXiv:1908.08530. Cited by: §1.
-  (2019) A corpus for reasoning about natural language grounded in photographs. In ACL, Cited by: §1, §2, Table 2.
-  (2019) Lxmert: learning cross-modality encoder representations from transformers. In EMNLP, Cited by: §1, §5.1.
-  (2016) Movieqa: understanding stories in movies through question-answering. In CVPR, Cited by: §1, §2, Table 2.
-  (2015) Learning spatiotemporal features with 3d convolutional networks. In ICCV, Cited by: §5.1.
-  (2015) Sequence to sequence-video to text. In ICCV, Cited by: §2.
-  (2018) Glue: a multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461. Cited by: §2.
-  (2016) Walk and learn: facial attribute representation learning from egocentric video and contextual data. In CVPR, Cited by: §2.
-  (2019) VATEX: a large-scale, high-quality multilingual dataset for video-and-language research. In ICCV, Cited by: §1, §2.
-  (2018) A broad-coverage challenge corpus for sentence understanding through inference. In NAACL, Cited by: §1, §2.
-  (2019) Visual entailment: a novel task for fine-grained image understanding. arXiv preprint arXiv:1901.06706. Cited by: §1, §2, Table 2.
-  (2016) Msr-vtt: a large video description dataset for bridging video and language. In CVPR, Cited by: §2.
-  (2019) XLNet: generalized autoregressive pretraining for language understanding. arXiv preprint arXiv:1906.08237. Cited by: §2.
-  (2016) Stacked attention networks for image question answering. In CVPR, Cited by: §2.
-  (2020) Clevrer: collision events for video representation and reasoning. In ICLR, Cited by: §2.
-  (2014) From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions. TACL. Cited by: §2, §2.
-  (2018) Qanet: combining local convolution with global self-attention for reading comprehension. arXiv preprint arXiv:1804.09541. Cited by: §4.2.
-  (2018) Mattnet: modular attention network for referring expression comprehension. In CVPR, Cited by: §1.
-  (2016) Modeling context in referring expressions. In ECCV, Cited by: §1.
-  (2019) Deep modular co-attention networks for visual question answering. In CVPR, Cited by: §2.
-  (2017) Multi-modal factorized bilinear pooling with co-attention learning for visual question answering. In ICCV, Cited by: §2.
-  (2019) From recognition to cognition: visual commonsense reasoning. In CVPR, Cited by: §1, §1, §2, §3.1, Table 2.
-  (2019) Freelb: enhanced adversarial training for language understanding. arXiv preprint arXiv:1909.11764. Cited by: §2.
Appendix A Additional Data Analysis
a.1 Statement Length Distribution
a.2 Statement Content
Table 7 shows the most common nouns, verbs and adjectives in positive statements, respectively.
a.3 Video Length Distribution
The video clips collected from MovieClips are all 40 seconds long. For video clips collected from TV shows, their lengths vary from 15 to 40 seconds, shown in Figure 8.