Log In Sign Up

Video Graph Transformer for Video Question Answering

This paper proposes a Video Graph Transformer (VGT) model for Video Quetion Answering (VideoQA). VGT's uniqueness are two-fold: 1) it designs a dynamic graph transformer module which encodes video by explicitly capturing the visual objects, their relations, and dynamics for complex spatio-temporal reasoning; and 2) it exploits disentangled video and text Transformers for relevance comparison between the video and text to perform QA, instead of entangled cross-modal Transformer for answer classification. Vision-text communication is done by additional cross-modal interaction modules. With more reasonable video encoding and QA solution, we show that VGT can achieve much better performances on VideoQA tasks that challenge dynamic relation reasoning than prior arts in the pretraining-free scenario. Its performances even surpass those models that are pretrained with millions of external data. We further show that VGT can also benefit a lot from self-supervised cross-modal pretraining, yet with orders of magnitude smaller data. These results clearly demonstrate the effectiveness and superiority of VGT, and reveal its potential for more data-efficient pretraining. With comprehensive analyses and some heuristic observations, we hope that VGT can promote VQA research beyond coarse recognition/description towards fine-grained relation reasoning in realistic videos. Our code is available at


Visual Spatio-Temporal Relation-Enhanced Network for Cross-Modal Text-Video Retrieval

The task of cross-modal retrieval between texts and videos aims to under...

Cross-Modal Contrastive Learning for Robust Reasoning in VQA

Multi-modal reasoning in visual question answering (VQA) has witnessed r...

Attend What You Need: Motion-Appearance Synergistic Networks for Video Question Answering

Video Question Answering is a task which requires an AI agent to answer ...

Mucko: Multi-Layer Cross-Modal Knowledge Reasoning for Fact-based Visual Question Answering

Fact-based Visual Question Answering (FVQA) requires external knowledge ...

Relation-aware Language-Graph Transformer for Question Answering

Question Answering (QA) is a task that entails reasoning over natural la...

VindLU: A Recipe for Effective Video-and-Language Pretraining

The last several years have witnessed remarkable progress in video-and-l...

Perceiver-VL: Efficient Vision-and-Language Modeling with Iterative Latent Attention

We present Perceiver-VL, a vision-and-language framework that efficientl...

1 Introduction

Since the 1960s, the very beginning of Artificial Intelligence (AI), long efforts and steady progresses have been made towards machine systems that can demonstrate their understanding of the dynamic visual world by responding to humans’ natural language queries in the context of videos which directly reflect our physical surroundings. In particular, since 2019


, we have been witnessing a drastic advancement in such multi-disciplinary AI where computer vision, natural language processing as well as knowledge reasoning are coordinated for accurate decision making. This advancement stems, in part from the success of

multi-modal pretraining on web-scale vision-text data [8, 21, 31, 34, 38, 44, 52, 53, 54, 63]

, and in part from the unified deep neural network that can well model both vision and natural language data,

i.e., transformer [55]. As a typical multi-disciplinary AI task, Video Question Answering (VideoQA) has benefited a lot from these developments which helps to propel the field steadily forward over the use of purely conventional techniques [14, 16, 20, 23, 28, 60, 71].

Despite the excitement, we find that the advances made by such transformer-style models mostly lie in answering questions that demand the holistic recognition or description of video contents [30, 48, 62, 63, 64, 68, 72]. The problem of answering questions that challenge real-world visual relation reasoning, especially the causal and temporal relations that feature video dynamics [20, 59], is largely under-explored. Cross-modal pretraining seems promising [29, 67, 70]. Yet, it requires the handling of prohibitively large-scale video-text data [15, 70], or otherwise the performances are still inferior to the state-of-the-art (SoTA) conventional techniques [29, 47, 67]. In this work, we empirically reveal two major reasons account for the failure: 1) Video encoders are overly simplistic. Current video encoders are either 2D neural networks (CNNs [18, 45] or Transformers [13]) operated over sparse frames or 3D neural networks  [5, 37, 61] operated over short video segments. Such networks encode the videos holistically, but fail to explicitly model the fine-grained details, i.e., spatio-temporal interactions between visual objects. Consequently, the resulting VideoQA models are weak in reasoning and require large-scale video data for learning to compensate for such weak forms of input. 2) Formulation of VideoQA problem is sub-optimal. Often, in multi-choice QA, the video, question, and each candidate answer are appended (or fused) into one holistic token sequence and fed to a cross-modal Transformer to gain a global representation for answer classification [72, 29]. Such a global representation is weak in disambiguating the candidate answers, because the video and question portions are the same and large, which may overwhelm the short answer and dominate the overall representation. In open-ended QA (popularly formulated as a multi-class classification problem [62]), answers are treated as class indexes and their word semantics (which are helpful for QA.) are ignored. The insufficient information modelling exacerbates the data-hungry issue and leads to sub-optimal performance as well.

To improve visual relation reasoning and also reduce the data demands for video question answering, we propose the Video Graph Transformer (VGT) model. VGT addresses the aforementioned problems and advances over previous transformer-style VideoQA models mainly in two aspects: 1) For video encoder, it designs a dynamic graph transformer module which explicitly captures the objects and relations as well as their dynamics to improve visual reasoning in dynamic scenario. 2) For problem formulation, it exploit separate vision and text transformers to encode video and text respectively for similarity (or relevance) comparison instead of using a single cross-modal transformer to fuse the vision and text information for answer classification. Vision-text communication is done by additional cross-modal interaction modules. Through more sufficient video information modelling and more reasonable QA problem solution, we show that VGT can achieve much better performances on benchmarks featuring dynamic relation reasoning than previous arts including those pretrained on million-scale vision-text data. Such strong performance comes even without using external data to pretrain. When pretraining VGT with a small amount of data, we can observe further and non-trivial performance improvements. The results clearly demonstrate VGT’s effectiveness and superiority in visual reasoning, as well as its potential for more data-efficient111The model demands on less training data to achieve good performance. video-language pretraining.

To summarize our contributions: 1) We propose Video Graph Transformer (VGT) that advances VideoQA from shallow description to in-depth reason. 2) We design a dynamic graph transformer module which shows strength for visual reasoning. The module is task-agnostic and can be easily applied to other video-language tasks. 3) We achieve SoTA results on NExT-QA [59] and TGIF-QA [20] that task visual reasoning of dynamic visual contents. Also, our structured video representation gives a promise for data-efficient video-language pretraining.

2 Related Work

Conventional Techniques for VideoQA. Prior to the success of Transformer for vision-language tasks, various techniques, e.g., cross-modal attention [20, 33, 22], motion-appearance memory [16, 14, 36], and graph neural networks [23, 35, 41], have been proposed to model informative videos contents for answering questions. Yet, most of them leverage frame- or clip-level video representations as information source. Recently, graphs constructed over object-level representations [19, 36, 47, 60] have demonstrated superior performance, especially on benchmarks that emphasize visual relation reasoning [20, 49, 50, 59]. However, these graph methods either construct monolithic graphs that do not disambiguate between relations in 1) space and time, 2) local and global scopes [19, 57], or build static graphs at frame-level without explicitly capturing the temporal dynamics [36, 42, 60]. The monolithic graph is cumbersome to long videos where multiple objects interact in space-time. Besides, the static graphs may lead to incorrect relations (e.g., hug vs. fight) or fail to capture dynamic relations (e.g., take away

). In this work, we model video as a local-to-global dynamic visual graph, and design graph transformer module to explicitly model the objects, their relations, and dynamics, for exploiting object and relations in adjacent frames to calibrate the spurious relations obtained at static frame-level. Importantly, we also integrate strong language models and explore cross-modal pretraining techniques to learn the structured video representations in a self-supervised manner.

Transformer for VideoQA. Pioneer works [32, 48, 63, 64, 72] learn generalizable representations from HowTo100M [40] by either applying various proxy tasks [72], or curating more tailored-made supervisions (e.g., future utterance [48] and QA pairs [64]) for VideoQA. However, they focus on answering questions that demand the holistic recognition [62] or shallow description [68], and their performances on visual relation reasoning [20, 59] remains unknown. Furthermore, recent works [3, 70] reveal that these models may suffer from performance lose on open-domain questions due to the heavy noise [1, 39] and limited data scope of HowTo100M. Recent efforts tend to use open-domain vision-text data for end-to-end learning. ClipBERT [29] takes advantage of image-caption data [7, 27] for pretraining, but it only has limited performance improvement on temporal reasoning tasks [20], as the temporal relations are hard to learn from static images. In addition, ClipBERT relies on human annotated descriptions which are expensive to annotate and hard to scale up. More recent works [15, 70] collect million-scale user-generated (vastly abundant on the Web) vision-text data [3, 51, 70] for pretraining, but suffers from huge computational cost to train on such large-scale datasets. Two latest works [6, 12] reveal the potential of Transformers for learning on the target datasets (relatively small scale). While promising, they either target at revealing the single-frame bias of benchmark datasets by using image-text pretrained features (e.g. from CLIP [44]), or only demonstrate the model’s effectiveness on synthesized data [65]. Overall, the poor-dynamic-reasoning and data-hungry problems in existing transformer-style video-language models largely motivate this work. To alleviate these problems, we explicitly model the objects and relations for dynamic visual reasoning and incorporate structure priors (or relational inductive bias [4]) into transformer architectures to reduce the demand on data.

Graph Transformer. The connection between graph neural networks and Transformer has earned increasing attention [56, 66, 69]. Nonetheless, the major advancements are made in modelling natural graph data (e.g. social connections) by either incorporating graph expertise (e.g., node degrees) into self-attention block of Transformer [66], or designing transformer-style convolution blocks to fuse information from heterogeneous graphs [69]. A recent work [17] combines graphs and Transformers for video dialogues. Yet, it simply applies global transformer over pooled graph representations built from static frames and does not explicitly encode object and relation dynamics. Our work differs from it by designing and learning dynamic visual graph over video objects and using transformers to capture the temporal dynamics at both local and global scopes.

3 Method

3.1 Overview

Given a video v and a question q, VideoQA aims to combine the two stream information v and q to predict the answer a. Depending on the task settings, a can be given in multiple choices along with each question for multi-choice QA, or it is given in a global answer set for open-ended QA. In this work, we handle both types of VideoQA by optimizing the following objective:


in which can be corresponding to the candidate answers of each question in multi-choice QA, or corresponding to the global answer set in open-ended QA. denotes the mapping function with learnable parameters .

To solve the problem, we design a video graph transformer (VGT) model to perform the mapping in Eqn. (1). As illustrated in Fig. 1, at the visual part (Orange), VGT takes as input visual object graphs, and drives a global feature with the integration of textual information, to represent the query-relevant video content. At the textual part (Blue), VGT extracts the feature

Figure 1: Overview of video graph transformer (VGT) for VideoQA.

representations for all the candidate answers via a language model (e.g., BERT [11]). The final answer is determined by returning the candidate answers with maximal similarity (relevance score) between and via dot-product. At the heart of the model is the dynamic graph transformer module (DGT). The module clip-wisely reasons over the input graphs, and aggregates them into a sequence of feature representations which are then fed to a global transformer to achieve . During training, the whole framework is end-to-end optimized with Softmax cross-entropy loss. For pretraining with weakly-paired video-text data, we adopt cross-modal matching as the major proxy task and optimize the model in a contrastive manner [44] along with masked language modelling [11].

Figure 2: Illustration of graph construction in a short video clip of frames. The nodes of same color denote same object.

3.2 Video Graph Representation

Given a video, we sparsely sample frames in a way analogous to [60]. The frames are evenly distributed into clips of length . For each sampled frame (see Fig. 2), we extract RoI-aligned features as object appearance representations along with their spatial locations with a pretrained object detector [2, 45], where represents the -th object region in a frame. Additionally, we obtain an image-level feature for all the sampled frames with a pretrained image classification model [18]. serve as global contexts to augment the graph representations aggregated from the local objects.

To find the same object across different frames within a clip, we define a linking score by considering their appearance and spatial location:



denotes the cosine similarity between two detected objects

and in adjacent frames. Intersection-over-union (IoU) computes the location overlap of objects and . Our experiments always set as one. The detected objects in the first frame of each clip are designated as anchor objects. Detected objects in consecutive frames are then linked to the anchor objects by greedily maximizing frame by frame222We assume that the group of objects do not change in a short video clip.. By aligning objects within a clip, we ensure the consistency of the node and edge representations for the graphs constructed at different frames.

Next, we concatenate the object appearance and location representations and project the combined feature into the -dimensional space via


where denotes feature concatenation and is obtained by applying a convolution over the relative coordinates as in [60]. The function

denotes a linear transformation with parameters

. With , the relations in the -th frame can be initialized as pairwise similarities:


where and denote linear transformations with parameters and respectively. We use different transformations to reflect the asymmetric nature of real-world subject-object interactions [26, 58]. For symmetric relations, we expect that the learned parameters and are quite similar. is the Softmax operation that normalizes each row. For brevity, we use to denote the graph representation of the -th frame where are node representations and are edge representations of the graph.

3.3 Dynamic Graph Transformer

Our dynamic graph transformer (DGT) takes as input a set of visual graphs clip-wisely, and outputs a sequence of representations by mining the temporal dynamics of objects and their spatial interactions. To this end, we sequentially operate a temporal graph transformer unit, a spatial graph convolution unit and a hierarchical aggregation unit as detailed below.

3.3.1 Temporal Graph Transformer

As illustrated in Fig. 3, the temporal graph transformer unit takes as input a set of graphs and outputs a new set of graphs by mining the temporal dynamics among them via a node transformer (NTrans) and an edge transformer (ETrans). For completeness, we briefly recap the self-attention in Transformer [55]. It uses a multi-head self-attention (MHSA) to fuse a sequence of input features :


where is a linear transformation with parameters , and


where , and

denote the linear transformations of the query, key, and value vectors of the

-th self-attention (SA) head respectively. denotes the number of self-attention heads, and SA is defined as:


in which is the dimension of the key vector. Finally, a skip-connection with layer normalization (LN) is applied to the output sequence . can undergo more MHSAs depending on the number of transformer layers.

In temporal graph transformer, we apply self-attention blocks to enhance the node (or object) representations by aggregating information from other nodes of the same object from all adjacent frames within a clip:


in which denotes a sequence of feature representations corresponding to object in a video clip of length . Our motivation behind the node

Figure 3: Illustration of temporal graph transformer in a short video clip.

transformer is that it models the change of single object behaviours and thus infer the dynamic actions (e.g. bend down). Also, it is helpful in improving the objects’ appearance feature in the cases where the object at certain frames suffer from motion blur or partial occlusion.

Based on the new nodes , we update the relation matrix via Eqn. (4). Then, to explicitly model the temporal relation dynamics, we apply an edge transformer on the updated relation matrices:


where () is the adjacency matrices that are row-wisely expanded. Our motivation is that the relations captured at static frames may be spurious, trivial or incomplete. The edge transformer can help to calibrate the wrong relations and recall the missing ones. For brevity, we refer to the temporally contextualized graph at the -th frame as .

3.3.2 Spatial Graph Convolution

The temporal graph transformer focuses on temporal relation reasoning. To reason over the object spatial interactions, we apply a -layer graph attention convolution [25] on all the graphs:


where is the graph parameters at the -th layer.

is the identity matrix for skip connections.

are initialized by the output node representations as aforementioned. The index is omitted for brevity. A last skip-connection: is used to obtain the final node representations.

3.3.3 Hierarchical Aggregation

The node representations so far have explicitly token into account the objects’ spatial and temporal interactions. But such interactions are mostly atomic. To aggregate these atomic interactions into higher-level video elements, we adopt a hierarchical aggregation strategy in Fig. 4.

Figure 4: Hierarchical Aggregation.

First, we aggregate the graph nodes at each frame by a simple attention:


where is linear transformation with parameters . The graph representation captures a local object interactions. It may lose sight of a global picture of a frame, especially since we only retain objects and cannot guarantee that they include all the objects of interest in that frame. As such, we complement with the frame-level feature by concatenation:


in which and are linear transformations with parameters and respectively. We next pool the local interactions to obtain a sequence of clip-level feature representations via:


The set of clips are finally represented by .

3.4 Cross-modal Interaction

To find the informative visual contents with respect to a particular text query, a cross-model interaction between the visual and textual nodes is essential. Given a set of visual nodes denoted by , we integrate textual information into the visual nodes via a simple cross-modal attention:


where is the number of tokens in the text query. In principle, the can be visual representations from different levels of the DGT module similar to [60]. In our experiment, we explore perfomring the cross-modal interaction with visual representations at the object-level ( in Eqn. (3)), frame-level ( in Eqn. (12)), and clip-level ( in Eqn. (13)). We find that the results vary among different datasets. As a default, we perform cross-modal interaction at the output of the DGT module (i.e), since the number of nodes at this stage is much smaller, and the node representations have already absorbed the information from the preceding layers. For the text node , we obtain them by a simple linear projection on the token outputs of a language model [11]:



. The text query Q can be questions in open-end QA or QA pairs in multi-choice QA. Note that in multi-choice QA, we max-pool the obtained query-aware visual representations with respect to different QA pairs to find the one that is mostly relevant to the video.

3.5 Global Transformer

The aforementioned DGT module pays attention to extract informative visual clues from video clips. To capture the temporal dynamics between these clips, we employ another -layer transformer over the cross-modal interacted clip feature (i.e), and add learnable sinusoidal temporal position embeddings [11]. Finally, the transformer’s outputs are mean-pooled to obtain the global representation for the entire video, which is defined as follows:


The global transformer has two major advantages: 1) It retains the overall hierarchical structure which progressively drives the video elements at different granularity as in [60]. 2) It improves the feature compatibility of vision and text, which may benefit cross-modal comparison.

3.6 Answer Prediction

To obtain a global representation for a particular answer candidate, we mean-pool its token representations from BERT by where denotes a candidate answer’s token representations, and is obtained in a way analogous to Eqn. (15). Its similarity with the query-aware video representation is then obtained via a dot-product. Consequently, the candidate answer of maximal similarity is returned as the final prediction:


in which , and denotes the number of candidate answers. Additionally, for open-ended QA, we follow previous works [60] and enable a video-absent QA by directly computing the similarities between the question representation (obtained in a way similar to ) and the answer representations . As a result, the final answer can be a joint decision:


in which is element-wise product. During training, we maximize the VQ, A

similarity corresponding to the correct answer of a given sample by optimizing the Softmax cross entropy loss function.

where is the matching score for the -th sample. if the answer index corresponds to the -th sample’s ground-truth answer and 0 otherwise.

3.7 Pretraining with Weakly-Paired Data

For cross-model matching, we encourage the representation of each video-text interacted representation to be closer to that of its paired description and be far away from that of negative descriptions which are randomly collected from other video-text pairs in each training iteration. This is formally achieved by maximizing the following contrastive objective:


where denotes the representations of all the negative video-description pairs of the -th sample. The parameters to be optimized are hidden in the process of calculating and as introduced above. For negative sampling, we sample them from the whole training set at each iteration. For masked language modelling, we only corrupt the positive description of each video for efficiency.

4 Experiment

4.1 Dataset and Configuration

We conduct experiments on benchmarks whose QAs feature temporal dynamics: 1) NExT-QA [59] is a manually annotated dataset that features causal and temporal object interaction in space-time. 2) TGIF-QA [20] features short GIFs; it asks questions about repeated action recognition, temporal state transition and frame QA which invokes a certain frame for answer. For better comparison, we also experiment on MSRVTT-QA [62] which challenges a holistic visual recognition or description. Other data statistics are presented in Appendix 0.A.

We decode the video into frames following [60], and then sparsely sample frames from each video. The frames are distributed into clips whose length . For each frame, we detect and keep regions of high confidence for NExT-QA (Top-5 are used in the pretraining-free experiments, refer to our analysis in Appendix 0.C.2 ), and for the other datasets, using the object detection model provided by [2]. The dimension of the models’ hidden states is . The default number of layers and self-attention heads in transformer are and ( for edge transformer in DGT) respectively. Besides, the number of graph layers is . For training, we use Adam optimizer with initial learning rate

of a cosine annealing schedule. The batch size is set to 64, and the maximum epoch varies from 10 to 30 among different datasets. Our pretraining data (

M) are collected from WebVid [3]. More details are presented in Appendix 0.B.

4.2 Sate-of-the-Art Comparison

In Table 1, we compare VGT with the prior arts on NExT-QA [60]. The results show that VGT surpasses the previous SoTAs by clear margins on both the val and test sets, improving the overall accuracy by 1.6% and 1.9% respectively. VGT even outperforms a latest work ATP [6] which is based on CLIP features [44] (VGT vs. ATP: 55.02% vs. 54.3%), and thus sets the new SoTA results. In particular, we note that such strong results come without considering large-scale cross-modal pretraining. When pretraining VGT with (relatively) small amount of data, we can further increase the results to 56.9% and 55.7% on NExT-QA val and test sets respectively (refer to our analysis of Table 5 in Sec. 4.4).

Method CM-Pretrain NExT-QA Val NExT-QA Test
Acc@C Acc@T Acc@D Acc@All Acc@C Acc@T Acc@D Acc@All
HGA [23] - 46.26 50.74 59.33 49.74 48.13 49.08 57.79 50.01
IGV [35] - - - - - 48.56 51.67 59.64 51.34
HQGA [60] - 48.48 51.24 61.65 51.42 49.04 52.28 59.43 51.75
P3D-G [9] - 51.33 52.30 62.58 53.40 - - - -
VQA-T* [64] - 41.66 44.11 59.97 45.30 42.05 42.75 55.87 44.54
VQA-T* [64] How2VQA69M 49.60 51.49 63.19 52.32 47.89 50.02 61.87 50.83
VGT (Ours) - 52.28 55.09 64.09 55.02 51.62 51.94 63.65 53.68
Table 1: Results on NExT-QA [59]. (Acc@C, T, D: Accuracy for Causal, Temporal and Descriptive questions respectively. *: Results reproduced with the official code.)

Compared with VQA-T [64] which also formulates VideoQA as problem of similarity comparison instead of classification, VGT outperforms it almost in all metrics. The strong results could be due to that VGT explicitly models the object interactions and dynamics for visual reasoning, instead of holistically encoding video clips with S3D [39, 61]. For a better analysis, we further replace the S3D encoder in VQA-T with our DGT module. As shown in Table 2 (S3D DGT), our DGT encoder significantly improves VQA-T’s result by 4.7%, in which most of the improvements are from answering reasoning

Models Size (M) NExT-QA Val
Acc@C Acc@T Acc@D Acc@All
VQA-T [64] 600 41.66 44.11 59.97 45.30
S3DDGT 641 47.53 48.08 62.42 50.02
CMTransCM 573 42.27 44.29 58.17 45.40
VGT (DistilBERT) 346 50.71 51.67 66.41 53.46
VGT (BERT) 511 52.28 55.09 64.09 55.02
Table 2: Detailed comparison with VQA-T [64]. CMTrans: Cross-Modal Transformer.

type of questions. Aside from the DGT module, we encode the candidate answers in the context of the corresponding question with a single language model, whereas VQA-T encodes Q and A independently with two language models [46]. Our method improves answer encoding with contexts and reduces the model size (or parameters), as shown in Table 2 (VGT (DistilBERT)). Finally, VQA-T adopts cross-modal transformer to fuse the video-question pair, whereas we design light-weight cross-modal interaction module. The module is more parameter efficient but has little impact on the performances (CMTransCM in Table 2).

Compared with other graph based methods [9, 23, 60], VGT enjoys several advantages: 1) It explicitly model the temporal dynamics of both objects and their interactions. 2) It solves VideoQA by explicit similarity comparison between the video and text instead of classification. 3) It represents both visual and textual data with Transformers which may improve the feature compatibility and benefit cross-modal interaction and comparison [11]. 4) VGT uses much few frames for training and inference (e.g., VGT vs. HQGA [60]: 32 vs. 256), which benefits efficiency for video encoding. The detailed analyses are given in Sec. 4.3.

In Table 3, we compare VGT with previous arts on the TGIF-QA and MSRVTT-QA datasets. The results show that VGT performs pretty well on the tasks of repeating action recognition and state transition that feature temporal dynamics, surpassing the previous pretraining-free SoTA results significantly by 10.6% (VGT vs. MASN [47]: 95.0% vs. 84.4%) and 6.8% (VGT vs. MHN [43]: 97.6% vs. 90.8%) respectively. It even beats the pretraining SoTA (i.e. MERLOT [70]) by about 1.0%, yet without using external data for cross-modal pretraining. On TGIF-QA-R [42] which is curated by making the negative answers in TGIF-QA more challenging, we can also observe remarkable improvements. Besides, VGT also achieves competitive results on normal descriptive QA tasks as defined in FrameQA and MSRVTT-QA though they are not our focus.

Models CM-Pretrain TGIF-QA MSRVTT
Action Transition FrameQA Action Transition -QA
LGCN [19] - 74.3 81.1 56.3 - - -
HGA [23] - 75.4 81.0 55.1 - - 35.5
HCRN [28] - 75.0 81.4 55.9 55.7 63.9 35.6
B2A [41] - 75.9 82.6 57.5 - - 36.9
HOSTR [10] - 75.0 83.0 58.0 - - 35.9
HAIR [36] - 77.8 82.3 60.2 - - 36.9
MASN [47] - 84.4 87.4 59.5 - - 35.2
PGAT [42] - 80.6 85.7 61.1 58.7 65.9 38.1
HQGA [60] - 76.9 85.6 61.3 - - 38.6
MHN [43] - 83.5 90.8 58.1 - - 38.6
ClipBERT [29]

VG+COCO Caption

82.8 87.8 60.3 - - 37.4
SiaSRea [67] VG+COCO Caption 79.7 85.3 60.2 - - 41.6
MERLOT [70] Youtube180M, CC3M 94.0 96.2 69.5 - - 43.1
VGT (Ours) - 95.0 97.6 61.6 59.9 70.5 39.7
Table 3: Results on TGIF-QA and MSVTT-QA. denotes TGIF-QA-R [42] whose multiple choices for repeated action and state transition are more challenging. We grey out the results reported in [42] regarding these two sub-tasks, because the candidate answers are slightly different as we have further rectified the redundant choices.

4.3 Model Analysis

DGT. The middle block of Table 4 shows that removing the DGT module (w/o DGT) (i.e. directly summarizing the object representations in each clip) leads to clear performance drops (2.0%) on all tasks that challenge spatio-temporal reasoning. We then study the temporal graph transformer module (w/o TTrans) by removing both NTrans and ETrans. It shows better results than removing the whole DGT module. Yet, its performances on tasks featuring temporal dynamics are still weak. We further ablate the temporal graph transformer module to investigate the independent contribution of the node transformer (NTrans) and edge transformer (ETrans). The results (w/o NTrans and w/o ETrans) demonstrate that both transformers benefit temporal dynamic modelling. Finally, the ablation study on the global frame feature reveals its vital role to DGT.

Models TGIF-QA NExT-QA Val
Action Trans Acc@C Acc@T Acc@D Acc@All
VGT 95.0 97.6 52.28 55.09 64.09 55.02
w/o DGT 89.6 95.4 50.10 52.85 64.48 53.22
w/o TTrans 94.0 97.6 50.86 53.04 64.86 53.74
w/o NTrans 94.5 97.4 50.79 54.22 63.32 53.84
w/o ETrans 94.8 97.4 51.25 54.34 64.48 54.30
w/o 93.5 97.0 50.44 53.97 63.32 53.58
CompCLS 70.1 79.9 42.96 46.96 53.02 45.82
Table 4: Study of model components

Similarity Comparison vs. Classification. We study a model variant by concatenating the outputs of the DGT module with the token representations from BERT in a way analogous to ClipBERT [29]. The formed text-video representation sequence is fed to a cross-modal transformer for information fusion. Then, the output of the ‘[CLS]’ token is fed to a

-way classifier in open-ended QA or a

-way classifier for binary relevance in multi-choice QA following [20, 28, 60]. As can be seen from the bottom part of Table 4, this classification model variant (Comp CLS) leads to drastic performance drops. To be complete, we also conduct additional experiments on the FrameQA task which is set as open-ended QA. Again, we find that the accuracy drops from 61.6% to 56.9%. A detailed analysis of the performances on the training and validation sets (see Appendix 0.C.1) reveals that the CLS-model suffers from serious over-fitting on the target datasets. The experiment demonstrates the superiority of solving QA by relevance comparison instead of answer classification.

Cross-modal Interaction. Fig. 5 investigates several implementation variants of the cross-modal interaction module as depicted in Sec. 3.4. The results

Figure 5: Study of Cross-modal Interaction.

suggest that it is better to integrate textual information at both the frame- and clip-level outputs (CM-CF) for TGIF-QA, while our default interaction at the clip-level outputs (CM-C) brings the optimal results on NExT-QA. Compared with the baselines that do not use cross-modal interaction, all three kinds of interactions improve the performances. We notice that the cross-modal interaction improves the accuracy on TGIF-QA by more than 10%. A possible reason is that the GIFs are trimmed short videos that only contain the QA-related visual contents. This greatly eases the challenge in spatial-temporal grounding of the positive answers, especially when most of the negative answers are not presence in the short GIFs. Thus, the cross-modal interaction performs more effectively on this dataset. The videos in NExT-QA are not trimmed, thereby the improvements are relatively smaller. Base on these observations, we perform cross-modal interaction at both the frame- and clip-level outputs for the temporal reasoning tasks in TGIF-QA, and keep the default implementation for other datasets.

4.4 Pretraining and Finetuning

Table 5 presents a comparison between VGT with and without pretraining. We can see that pretraining can steadily boost the QA performance, especially on NExT-QA. The relatively smaller improvements on TGIF-QA could be due to that TGIF-QA dataset is large, and has enough annotated data for fine-tuning. As such, pretraining helps little [73]. Besides, we find that finetuning with masked language modelling (MLM) can improve the generalization from val to test set, and thus achieves the best overall accuracy (i.e. 55.7%) on NExT-QA test set. Fig. 6 studies the QA performances on NExT-QA val set with respect to different amounts of pretraining data. Generally, there is a clear tendency of performance improvements for the overall accuracy (Acc@All) when more data is available. A more detailed analysis shows that these improvements mostly come from a stronger performance in answering causal (Acc@C) and descriptive (Acc@D) questions. For temporal questions, it seems that pretraining with more data does not help much. Therefore, to boost performance, it is promising to add more data or explore a better way to handle temporal languages.

Figure 6: Results of pretraining with different amounts of data.
Methods TGIF-QA NExT-QA Val NExT-QA Test
Action Trans Acc@C Acc@T Acc@D Acc@All Acc@C Acc@T Acc@D Acc@All
VGT 59.9 70.5 51.29 56.02 64.99 54.94 50.82 52.29 63.27 53.51
VGT (FT w/ QA) 60.2 71.0 53.93 56.20 70.14 57.19 51.73 53.78 67.05 54.88
VGT (FT w/ QA & MLM) 60.5 71.5 53.43 56.39 69.50 56.89 52.78 54.54 67.26 55.70
Table 5: Study of cross-model pretraining. Results on NExT-QA are with 20 regions.

4.5 Qualitative Analysis

Figure 7: Result visualization on NExT-QA [59]. The ground-truth answers are in green.

In Fig. 7, we qualitatively analyze the benefits of both dynamic graph transformer and pretraining. The example in (a) shows that the model without the DGT module is prone to predicting atomic or contact actions (e.g. ‘grab’) that can be captured at static frame-level. (b) shows that the model without pretraining fails to predict the answer that is highly abstract (e.g. ‘adjust’). Finally, we show a failure case in (c). It indicates that our model tends to predict distractor answers that are semantically close to the questions when the object of interests in the video are small and the detector fails to detect it. Keeping more detected regions could be helpful, but one needs to carefully balance the graph complexity as well as the inference efficiency. Another alternative is to perform modulated detection as in [24], we leave it for future exploration.

5 Conclusions

We presented video graph transformer which explicitly exploits the objects, their relations, and dynamics, to improve visual reasoning and alleviate the data-hungry issue for VideoQA. Our extensive experiments show that VGT can achieve superior performances as compared with previous SoTA methods on tasks that challenge temporal dynamic reasoning. The performance even surpasses those methods that are pretrained on large-scale vision-text data. To study the learning capacity of VGT, we further explored pretraining on weakly-paired video-text data and obtained promising results. With careful and comprehensive analyses of the model, we hope this work can encourage more efforts in designing effectiveness models to alleviate the burden of handling large-scale data, and also promote VQA research that goes beyond a holistic recognition/description to reason about the fine-grained video details.


This research is supported by the Sea-NExT joint Lab. Major work was done when Junbin was a research intern at Sea AI Lab. We greatly thank Angle Yao as well as the anonymous reviewers for their thoughtful comments towards a better work.


  • [1] E. Amrani, R. Ben-Ari, D. Rotman, and A. Bronstein (2021)

    Noise estimation using density estimation for self-supervised multimodal learning

    In AAAI Conference on Artificial Intelligence (AAAI), Vol. 35, pp. 6644–6652. Cited by: §2.
  • [2] P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L. Zhang (2018)

    Bottom-up and top-down attention for image captioning and visual question answering


    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    pp. 6077–6086. Cited by: §3.2, §4.1.
  • [3] M. Bain, A. Nagrani, G. Varol, and A. Zisserman (2021) Frozen in time: a joint video and image encoder for end-to-end retrieval. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 1728–1738. Cited by: Appendix 0.B, §2, §4.1.
  • [4] P. W. Battaglia, J. B. Hamrick, V. Bapst, A. Sanchez-Gonzalez, V. Zambaldi, M. Malinowski, A. Tacchetti, D. Raposo, A. Santoro, R. Faulkner, et al. (2018)

    Relational inductive biases, deep learning, and graph networks

    arXiv preprint arXiv:1806.01261. Cited by: §2.
  • [5] G. Bertasius, H. Wang, and L. Torresani (2021) Is space-time attention all you need for video understanding?. In ICML, pp. 813–824. Cited by: §1.
  • [6] S. Buch, C. Eyzaguirre, A. Gaidon, J. Wu, L. Fei-Fei, and J. C. Niebles (2022) Revisiting the” video” in video-language understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2917–2927. Cited by: §2, §4.2.
  • [7] X. Chen, H. Fang, T. Lin, R. Vedantam, S. Gupta, P. Dollár, and C. L. Zitnick (2015) Microsoft coco captions: data collection and evaluation server. arXiv preprint arXiv:1504.00325. Cited by: §2.
  • [8] Y. Chen, L. Li, L. Yu, A. El Kholy, F. Ahmed, Z. Gan, Y. Cheng, and J. Liu (2020) Uniter: universal image-text representation learning. In European Conference on Computer Vision (ECCV), pp. 104–120. Cited by: §1.
  • [9] A. Cherian, C. Hori, T. K. Marks, and J. Le Roux (2022) (2.5+ 1) d spatio-temporal scene graphs for video question answering. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 36, pp. 444–453. Cited by: §4.2, Table 1.
  • [10] L. H. Dang, T. M. Le, V. Le, and T. Tran (2021-08) Hierarchical object-oriented spatio-temporal reasoning for video question answering. In IJCAI, Cited by: Table 3.
  • [11] J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) Bert: pre-training of deep bidirectional transformers for language understanding. In NAACL, Cited by: §1, §3.1, §3.4, §3.5, §4.2.
  • [12] D. Ding, F. Hill, A. Santoro, M. Reynolds, and M. Botvinick (2021) Attention over learned object embeddings enables complex visual reasoning. Advances in neural information processing systems (NeurIPS) 34. Cited by: §2.
  • [13] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al. (2020) An image is worth 16x16 words: transformers for image recognition at scale. In International Conference on Representation Learning (ICLR), Cited by: §1.
  • [14] C. Fan, X. Zhang, S. Zhang, W. Wang, C. Zhang, and H. Huang (2019)

    Heterogeneous memory enhanced multimodal attention model for video question answering

    In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1999–2007. Cited by: §1, §2.
  • [15] T. Fu, L. Li, Z. Gan, K. Lin, W. Y. Wang, L. Wang, and Z. Liu (2021-11) VIOLET: end-to-end video-language transformers with masked visual-token modeling. In arXiv preprint arXiv:2111.12681, Cited by: §1, §2.
  • [16] J. Gao, R. Ge, K. Chen, and R. Nevatia (2018) Motion-appearance co-memory networks for video question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6576–6585. Cited by: §1, §2.
  • [17] S. Geng, P. Gao, M. Chatterjee, C. Hori, J. Le Roux, Y. Zhang, H. Li, and A. Cherian (2021) Dynamic graph representation learning for video dialog via multi-modal shuffled transformers. In AAAI Conference on Artificial Intelligence (AAAI), Cited by: §2.
  • [18] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778. Cited by: §1, §3.2.
  • [19] D. Huang, P. Chen, R. Zeng, Q. Du, M. Tan, and C. Gan (2020) Location-aware graph convolutional networks for video question answering. In AAAI Conference on Artificial Intelligence (AAAI), Vol. 34, pp. 11021–11028. Cited by: §2, Table 3.
  • [20] Y. Jang, Y. Song, Y. Yu, Y. Kim, and G. Kim (2017) Tgif-qa: toward spatio-temporal reasoning in visual question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2758–2766. Cited by: Table 6, Appendix 0.A, §1, §1, §1, §2, §2, §4.1, §4.3.
  • [21] C. Jia, Y. Yang, Y. Xia, Y. Chen, Z. Parekh, H. Pham, Q. Le, Y. Sung, Z. Li, and T. Duerig (2021) Scaling up visual and vision-language representation learning with noisy text supervision. In ICML, pp. 4904–4916. Cited by: §1.
  • [22] J. Jiang, Z. Chen, H. Lin, X. Zhao, and Y. Gao (2020) Divide and conquer: question-guided spatio-temporal contextual attention for video question answering. In AAAI Conference on Artificial Intelligence (AAAI), Vol. 34, pp. 11101–11108. Cited by: §2.
  • [23] P. Jiang and Y. Han (2020) Reasoning with heterogeneous graph alignment for video question answering. In AAAI Conference on Artificial Intelligence (AAAI), Cited by: §1, §2, §4.2, Table 1, Table 3.
  • [24] A. Kamath, M. Singh, Y. LeCun, G. Synnaeve, I. Misra, and N. Carion (2021) MDETR-modulated detection for end-to-end multi-modal understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 1780–1790. Cited by: §4.5.
  • [25] T. N. Kipf and M. Welling (2017) Semi-supervised classification with graph convolutional networks. In International Conference on Representation Learning (ICLR), Cited by: §3.3.2.
  • [26] R. Krishna, I. Chami, M. Bernstein, and L. Fei-Fei (2018) Referring relationships. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6867–6876. Cited by: §3.2.
  • [27] R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. Kalantidis, L. Li, D. A. Shamma, et al. (2017) Visual genome: connecting language and vision using crowdsourced dense image annotations. IJCV 123 (1), pp. 32–73. Cited by: §2.
  • [28] T. M. Le, V. Le, S. Venkatesh, and T. Tran (2020) Hierarchical conditional relation networks for video question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9972–9981. Cited by: §1, §4.3, Table 3.
  • [29] J. Lei, L. Li, L. Zhou, Z. Gan, T. L. Berg, M. Bansal, and J. Liu (2021) Less is more: clipbert for video-and-language learning via sparse sampling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7331–7341. Cited by: §1, §2, §4.3, Table 3.
  • [30] J. Lei, L. Yu, M. Bansal, and T. L. Berg (2018) TVQA: localized, compositional video question answering. In Empirical Methods in Natural Language Processing (EMNLP), Cited by: §1.
  • [31] J. Li, R. Selvaraju, A. Gotmare, S. Joty, C. Xiong, and S. C. H. Hoi (2021) Align before fuse: vision and language representation learning with momentum distillation. In Advances in neural information processing systems (NeurIPS), Vol. 34. Cited by: §1.
  • [32] L. Li, Y. Chen, Y. Cheng, Z. Gan, L. Yu, and J. Liu (2020) HERO: hierarchical encoder for video+ language omni-representation pre-training. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 2046–2065. Cited by: §2.
  • [33] X. Li, J. Song, L. Gao, X. Liu, W. Huang, X. He, and C. Gan (2019) Beyond rnns: positional self-attention with co-attention for video question answering. In AAAI Conference on Artificial Intelligence (AAAI), pp. 8658–8665. Cited by: §2.
  • [34] X. Li, X. Yin, C. Li, P. Zhang, X. Hu, L. Zhang, L. Wang, H. Hu, L. Dong, F. Wei, et al. (2020) Oscar: object-semantics aligned pre-training for vision-language tasks. In European Conference on Computer Vision (ECCV), pp. 121–137. Cited by: §1.
  • [35] Y. Li, X. Wang, J. Xiao, W. Ji, and T. Chua (2022) Invariant grounding for video question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2928–2937. Cited by: §2, Table 1.
  • [36] F. Liu, J. Liu, W. Wang, and H. Lu (2021-10) HAIR: hierarchical visual-semantic relational reasoning for video question answering. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 1698–1707. Cited by: §2, Table 3.
  • [37] Z. Liu, J. Ning, Y. Cao, Y. Wei, Z. Zhang, S. Lin, and H. Hu (2022) Video swin transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3202–3211. Cited by: §1.
  • [38] J. Lu, D. Batra, D. Parikh, and S. Lee (2019) Vilbert: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In Advances in neural information processing systems (NeurIPS), pp. 13–23. Cited by: §1.
  • [39] A. Miech, J. Alayrac, L. Smaira, I. Laptev, J. Sivic, and A. Zisserman (2020) End-to-end learning of visual representations from uncurated instructional videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9879–9889. Cited by: §2, §4.2.
  • [40] A. Miech, D. Zhukov, J. Alayrac, M. Tapaswi, I. Laptev, and J. Sivic (2019) Howto100m: learning a text-video embedding by watching hundred million narrated video clips. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 2630–2640. Cited by: §2.
  • [41] J. Park, J. Lee, and K. Sohn (2021) Bridge to answer: structure-aware graph interaction network for video question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 15526–15535. Cited by: §2, Table 3.
  • [42] L. Peng, S. Yang, Y. Bin, and G. Wang (2021) Progressive graph attention network for video question answering. In ACM MM, pp. 2871–2879. Cited by: Appendix 0.A, §2, §4.2, Table 3.
  • [43] M. Peng, C. Wang, Y. Gao, Y. Shi, and X. Zhou (2022) Multilevel hierarchical network with multiscale sampling for video question answering. IJCAI. Cited by: §4.2, Table 3.
  • [44] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al. (2021) Learning transferable visual models from natural language supervision. In ICML, pp. 8748–8763. Cited by: §1, §2, §3.1, §4.2.
  • [45] S. Ren, K. He, R. Girshick, and J. Sun (2015) Faster r-cnn: towards real-time object detection with region proposal networks. Advances in neural information processing systems (NeurIPS) 28. Cited by: §1, §3.2.
  • [46] V. Sanh, L. Debut, J. Chaumond, and T. Wolf (2019) DistilBERT, a distilled version of bert: smaller, faster, cheaper and lighter. Advances in neural information processing systems (NeurIPS). Cited by: §0.C.3, §4.2.
  • [47] A. Seo, G. Kang, J. Park, and B. Zhang (2021) Attend what you need: motion-appearance synergistic networks for video question answering. In ACL, pp. 6167–6177. Cited by: §1, §2, §4.2, Table 3.
  • [48] P. H. Seo, A. Nagrani, and C. Schmid (2021) Look before you speak: visually contextualized utterances. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 16877–16887. Cited by: §1, §2.
  • [49] X. Shang, D. Di, J. Xiao, Y. Cao, X. Yang, and T. Chua (2019) Annotating objects and relations in user-generated videos. In Proceedings of the 2019 on International Conference on Multimedia Retrieval (ICMR), pp. 279–287. Cited by: §2.
  • [50] X. Shang, J. Xiao, D. Di, and T. Chua (2019) Relation understanding in videos: a grand challenge overview. In Proceedings of the 27th ACM International Conference on Multimedia (MM), pp. 2652–2656. Cited by: §2.
  • [51] P. Sharma, N. Ding, S. Goodman, and R. Soricut (2018) Conceptual captions: a cleaned, hypernymed, image alt-text dataset for automatic image captioning. In ACL, pp. 2556–2565. Cited by: §2.
  • [52] W. Su, X. Zhu, Y. Cao, B. Li, L. Lu, F. Wei, and J. Dai (2020) VL-bert: pre-training of generic visual-linguistic representations. In International Conference on Representation Learning (ICLR), Cited by: §1.
  • [53] C. Sun, A. Myers, C. Vondrick, K. Murphy, and C. Schmid (2019) Videobert: a joint model for video and language representation learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 7464–7473. Cited by: §1.
  • [54] H. Tan and M. Bansal (2019) Lxmert: learning cross-modality encoder representations from transformers. In EMNLP, pp. 5100–5111. Cited by: §1.
  • [55] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in neural information processing systems (NeurIPS), Vol. 30. Cited by: §1, §3.3.1.
  • [56] L. Wang, X. Chang, S. Li, Y. Chu, H. Li, W. Zhang, X. He, L. Song, J. Zhou, and H. Yang (2021) Tcl: transformer-based dynamic graph modelling via contrastive learning. arXiv preprint arXiv:2105.07944. Cited by: §2.
  • [57] X. Wang and A. Gupta (2018) Videos as space-time region graphs. In European conference on computer vision (ECCV), pp. 399–417. Cited by: §2.
  • [58] J. Xiao, X. Shang, X. Yang, S. Tang, and T. Chua (2020) Visual relation grounding in videos. In European Conference on Computer Vision (ECCV), pp. 447–464. Cited by: §3.2.
  • [59] J. Xiao, X. Shang, A. Yao, and T. Chua (2021) Next-qa: next phase of question-answering to explaining temporal actions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9777–9786. Cited by: Table 6, Appendix 0.A, §0.C.1, §0.C.2, Table 7, §1, §1, §2, §2, Figure 7, §4.1, Table 1.
  • [60] J. Xiao, A. Yao, Z. Liu, Y. Li, W. Ji, and T. Chua (2022) Video as conditional graph hierarchy for multi-granular question answering. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), pp. 2804–2812. Cited by: §1, §2, §3.2, §3.2, §3.4, §3.5, §3.6, §4.1, §4.2, §4.2, §4.3, Table 1, Table 3.
  • [61] S. Xie, C. Sun, J. Huang, Z. Tu, and K. Murphy (2018) Rethinking spatiotemporal feature learning: speed-accuracy trade-offs in video classification. In European Conference on Computer Vision (ECCV), pp. 305–321. Cited by: §1, §4.2.
  • [62] D. Xu, Z. Zhao, J. Xiao, F. Wu, H. Zhang, X. He, and Y. Zhuang (2017) Video question answering via gradually refined attention over appearance and motion. In ACM MM, pp. 1645–1653. Cited by: Table 6, §1, §2, §4.1.
  • [63] H. Xu, G. Ghosh, P. Huang, D. Okhonko, A. Aghajanyan, F. Metze, L. Zettlemoyer, and C. Feichtenhofer (2021) VideoCLIP: contrastive pre-training for zero-shot video-text understanding. In EMNLP, pp. 6787–6800. Cited by: §1, §1, §2.
  • [64] A. Yang, A. Miech, J. Sivic, I. Laptev, and C. Schmid (2021) Just ask: learning to answer questions from millions of narrated videos. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 1686–1697. Cited by: Appendix 0.B, §0.C.3, §1, §2, §4.2, Table 1, Table 2.
  • [65] K. Yi, C. Gan, Y. Li, P. Kohli, J. Wu, A. Torralba, and J. B. Tenenbaum (2019) CLEVRER: collision events for video representation and reasoning. In International Conference on Learning Representations (ICLR), Cited by: §2.
  • [66] C. Ying, T. Cai, S. Luo, S. Zheng, G. Ke, D. He, Y. Shen, and T. Liu (2021) Do transformers really perform badly for graph representation?. Advances in neural information processing systems (NeurIPS) 34. Cited by: §2.
  • [67] W. Yu, H. Zheng, M. Li, L. Ji, L. Wu, N. Xiao, and N. Duan (2021) Learning from inside: self-driven siamese sampling and reasoning for video question answering. Advances in neural information processing systems (NeurIPS) 34. Cited by: §1, Table 3.
  • [68] Y. Yu, J. Kim, and G. Kim (2018) A joint sequence fusion model for video question answering and retrieval. In European Conference on Computer Vision (ECCV), pp. 471–487. Cited by: §1, §2.
  • [69] S. Yun, M. Jeong, R. Kim, J. Kang, and H. J. Kim (2019)

    Graph transformer networks

    Advances in neural information processing systems (NeurIPS) 32. Cited by: §2.
  • [70] R. Zellers, X. Lu, J. Hessel, Y. Yu, J. S. Park, J. Cao, A. Farhadi, and Y. Choi (2021) Merlot: multimodal neural script knowledge models. In Advances in neural information processing systems (NeurIPS), Vol. 34. Cited by: §1, §2, §4.2, Table 3.
  • [71] Y. Zhong, W. Ji, J. Xiao, Y. Li, W. Deng, and T. Chua (2022) Video question answering: datasets, algorithms and challenges. arXiv preprint arXiv:2203.01225. Cited by: §1.
  • [72] L. Zhu and Y. Yang (2020) Actbert: learning global-local video-text representations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 8746–8755. Cited by: §1, §2.
  • [73] B. Zoph, G. Ghiasi, T. Lin, Y. Cui, H. Liu, E. D. Cubuk, and Q. Le (2020) Rethinking pre-training and self-training. In Advances in neural information processing systems (NeurIPS), Vol. 33, pp. 3833–3845. Cited by: §4.4.

Appendix 0.A Data Statistics

The statistical details of the experimented datasets are presented in Table 6. For better comparison with previous works, we focus on the multi-choice QA task in NExT-QA [59] though it has also defined open-ended QA. For TGIF-QA [20], we also conduct experiments on a latest version [42] which generates more challenging negative answers for each question in the multi-choice tasks. In particular, we further fix the ‘redundant answer’ issue as we find that there are about 10% of questions have redundant candidate answers and some of the candidate answers are even identical to the correct one. The rectified annotations will be released along with the code.

Datasets Main Challenges #Videos/#QAs Train Val Test VLen (s) QA
NExT-QA [59] Causal & Temporal Interaction 5.4K/48K 3.8K/34K 0.6K/5K 1K/9K 44 MC
TGIF-QA [20] Repetition Action 22.8K/22.7K 20.5K/20.5K - 2.3K/2.3K 3 MC
State Transition 29.5K/58.9K 26.4K/52.7K - 3.1K/6.2K 3 MC
Frame QA 39.5K/53.1K 32.3K/39.4K - 7.1K/13.7K 3 OE
MSRVTT-QA [62] Descriptive QA 10K/ 244K 6.5K/159K 0.5K/12K 3K/73K 15 OE
Table 6: Data statistics. OE: Open-Ended QA. MC: Multi-Choice QA, VLen (s): Average video length in seconds.

Appendix 0.B Implementation Details

For training with QA annotations, we firstly train the whole model (except for the object detection model) end-to-end, and then freeze BERT to fine-tune the other parts of the best model obtained at the st stage. The best results in the two stages are determined as final results. Note that our hyper-parameters are mostly searched on the NExT-QA validation set and kept unchanged for other datasets. The maximum epoch varies from 10 to 30 among different datasets. For pretraining with data crawled from the Web, we download about 80K video-text data (less than 5%) from WebVid2.5M 333 [3]. The videos are then extracted at 5 frames per second and are processed in the same way as for QA. We then optimize the model with an initial learning rate of

and batch size 64. The number of negative descriptions of a video for cross-modal matching is set to 63, and they are randomly selected from the descriptions of other videos in the whole training set. Besides, a text token is corrupted at a probability of 15% in masked language modelling. Following

[64], a corrupted token will be replaced with 1) the ‘[MASK]’ token by a chance of 80%, 2) a random token by a chance of 10%, and 3) the same token by a chance of 10%. We train the model by maximal 2 epochs which gives to the best generalization results, and it takes about 2 hours.

Appendix 0.C Additional Model Analysis

0.c.1 Similarity Comparison vs. Classification

To study the reason for the poor performance of the classification model variant described in Sec. 4.3 of the main text, we visualize the training and validation accuracy with regard to different training epochs in Fig. 8. The results indicate that the classification model variant suffers from serious over-fitting issues, especially on NExT-QA [59] whose QA contents are relative complex but with less training data. To study whether the problem comes from the classification formulation or the cross-modal transformer, we further substitute the cross-modal transformer (CM-Trans) with our cross-modal interaction (CM) module introduced in Sec. 3.4 of the main text. We find that such a substitution can slightly alleviate the problem. For example, on NExT-QA val set, the accuracy increases from 45.82% to 46.98%. Nevertheless, the performance is still much worse than a comparison-based model implementation (i.e. 55.02%). This experiment reveals two facts: 1) Formulating QA problem as classification is the major cause for the weak performance. 2) The cross-modal transformer exacerbates the over-fitting problem, possibly because it involves additional parameters.

Figure 8: Accuracy with regard to different training epochs.

0.c.2 Study of Video Sampling

In Fig. 9, we study the effect of sampled video clips and region proposals on NExT-QA [59] test set. Regarding the number of sampled video clips, we find that the setting of 8 clips steadily wins on 4 clips. This is understandable as the videos in NExT-QA are relatively long. As for the sampled regions, when learning the model from scratch, the setting of 5 regions gives relatively better result, e.g., 53.68%. Nonetheless, when pretraining are considered, the setting of 20 regions gives better result, e.g., 55.70%. Such difference could be due to that learning with more regions can yield over-fitting issues when the dataset is not large enough, since the constructed graph become much larger and more complex. Our speculation is also supported by the fact that the accuracy increases with the number of sampled regions when we only sample 4 video clips and thus less number of total graph nodes.

Figure 9: Investigation of sampled video clips and region proposals per frame. Results are reported on NExT-QA test set.

0.c.3 Model Efficiency

Models Acc@All #Params (M) GPU Memory Time
Train Infer Train Infer(FLOPs)
VQA-T [52] 45.30 156.5 5.6G 2.6G 2m8 2448M
VGT (BERT) 55.02 133.7 16.2G 3.9G 7m5 7121M
VGT (DistilBERT) 53.46 90.5 10.0G 3.5G 5m7 3922M
Table 7: Comparison of memory and time based on NExT-QA [59]. (2m8: 2 minutes per epoch and 8 epochs in total.)

We compare VGT with VQA-T [64] in Tab. 7 for better understanding of the memory and time cost. Experiments are done on 1 Tesla V100 GPU with batch size 64. We use 1 example to report inference FLOPs. Memory: VGT has less training parameters (133.7M vs. 156.5M) and thus smaller model size than VQA-T (511M vs. 600M). The BERT encoder in VGT takes 82% of the parameters, the vision part is lightweight with only 24M parameters. VGT needs more GPU memory for training. Yet, the memory for inference are fairly small and close to that of VQA-T. We also implement a smaller version of VGT by replacing BERT with DistilBERT [46] as in VQA-T. With nearly 0.6 number of VQA-T’s parameters (90.5/156.5M), we can still achieve strong performances (i.e. 53.46%). Time: Our FLOPs on 1 example is 2.9 that of VQA-T and 1.6 if we use DistilBERT. However, VGT converges much faster and needs much fewer epochs (total FLOPs) to get results superior to VQA-T when training with the same data. For example, on NExT-QA, VGT’s result at epoch 2 (50.16%) already significantly surpasses VQA-T’s best result (45.30%) achieved at epoch 8. Also, VGT’s result without pretraining can surpasses that of VQA-T pretrained with million-scale data. In this sense, VGT needs much fewer total FLOPs than VQA-T and other similar pretrained models for visual reasoning.