Since the 1960s, the very beginning of Artificial Intelligence (AI), long efforts and steady progresses have been made towards machine systems that can demonstrate their understanding of the dynamic visual world by responding to humans’ natural language queries in the context of videos which directly reflect our physical surroundings. In particular, since 2019
, we have been witnessing a drastic advancement in such multi-disciplinary AI where computer vision, natural language processing as well as knowledge reasoning are coordinated for accurate decision making. This advancement stems, in part from the success ofmulti-modal pretraining on web-scale vision-text data [8, 21, 31, 34, 38, 44, 52, 53, 54, 63]
, and in part from the unified deep neural network that can well model both vision and natural language data,i.e., transformer . As a typical multi-disciplinary AI task, Video Question Answering (VideoQA) has benefited a lot from these developments which helps to propel the field steadily forward over the use of purely conventional techniques [14, 16, 20, 23, 28, 60, 71].
Despite the excitement, we find that the advances made by such transformer-style models mostly lie in answering questions that demand the holistic recognition or description of video contents [30, 48, 62, 63, 64, 68, 72]. The problem of answering questions that challenge real-world visual relation reasoning, especially the causal and temporal relations that feature video dynamics [20, 59], is largely under-explored. Cross-modal pretraining seems promising [29, 67, 70]. Yet, it requires the handling of prohibitively large-scale video-text data [15, 70], or otherwise the performances are still inferior to the state-of-the-art (SoTA) conventional techniques [29, 47, 67]. In this work, we empirically reveal two major reasons account for the failure: 1) Video encoders are overly simplistic. Current video encoders are either 2D neural networks (CNNs [18, 45] or Transformers ) operated over sparse frames or 3D neural networks [5, 37, 61] operated over short video segments. Such networks encode the videos holistically, but fail to explicitly model the fine-grained details, i.e., spatio-temporal interactions between visual objects. Consequently, the resulting VideoQA models are weak in reasoning and require large-scale video data for learning to compensate for such weak forms of input. 2) Formulation of VideoQA problem is sub-optimal. Often, in multi-choice QA, the video, question, and each candidate answer are appended (or fused) into one holistic token sequence and fed to a cross-modal Transformer to gain a global representation for answer classification [72, 29]. Such a global representation is weak in disambiguating the candidate answers, because the video and question portions are the same and large, which may overwhelm the short answer and dominate the overall representation. In open-ended QA (popularly formulated as a multi-class classification problem ), answers are treated as class indexes and their word semantics (which are helpful for QA.) are ignored. The insufficient information modelling exacerbates the data-hungry issue and leads to sub-optimal performance as well.
To improve visual relation reasoning and also reduce the data demands for video question answering, we propose the Video Graph Transformer (VGT) model. VGT addresses the aforementioned problems and advances over previous transformer-style VideoQA models mainly in two aspects: 1) For video encoder, it designs a dynamic graph transformer module which explicitly captures the objects and relations as well as their dynamics to improve visual reasoning in dynamic scenario. 2) For problem formulation, it exploit separate vision and text transformers to encode video and text respectively for similarity (or relevance) comparison instead of using a single cross-modal transformer to fuse the vision and text information for answer classification. Vision-text communication is done by additional cross-modal interaction modules. Through more sufficient video information modelling and more reasonable QA problem solution, we show that VGT can achieve much better performances on benchmarks featuring dynamic relation reasoning than previous arts including those pretrained on million-scale vision-text data. Such strong performance comes even without using external data to pretrain. When pretraining VGT with a small amount of data, we can observe further and non-trivial performance improvements. The results clearly demonstrate VGT’s effectiveness and superiority in visual reasoning, as well as its potential for more data-efficient111The model demands on less training data to achieve good performance. video-language pretraining.
To summarize our contributions: 1) We propose Video Graph Transformer (VGT) that advances VideoQA from shallow description to in-depth reason. 2) We design a dynamic graph transformer module which shows strength for visual reasoning. The module is task-agnostic and can be easily applied to other video-language tasks. 3) We achieve SoTA results on NExT-QA  and TGIF-QA  that task visual reasoning of dynamic visual contents. Also, our structured video representation gives a promise for data-efficient video-language pretraining.
2 Related Work
Conventional Techniques for VideoQA. Prior to the success of Transformer for vision-language tasks, various techniques, e.g., cross-modal attention [20, 33, 22], motion-appearance memory [16, 14, 36], and graph neural networks [23, 35, 41], have been proposed to model informative videos contents for answering questions. Yet, most of them leverage frame- or clip-level video representations as information source. Recently, graphs constructed over object-level representations [19, 36, 47, 60] have demonstrated superior performance, especially on benchmarks that emphasize visual relation reasoning [20, 49, 50, 59]. However, these graph methods either construct monolithic graphs that do not disambiguate between relations in 1) space and time, 2) local and global scopes [19, 57], or build static graphs at frame-level without explicitly capturing the temporal dynamics [36, 42, 60]. The monolithic graph is cumbersome to long videos where multiple objects interact in space-time. Besides, the static graphs may lead to incorrect relations (e.g., hug vs. fight) or fail to capture dynamic relations (e.g., take away
). In this work, we model video as a local-to-global dynamic visual graph, and design graph transformer module to explicitly model the objects, their relations, and dynamics, for exploiting object and relations in adjacent frames to calibrate the spurious relations obtained at static frame-level. Importantly, we also integrate strong language models and explore cross-modal pretraining techniques to learn the structured video representations in a self-supervised manner.
Transformer for VideoQA. Pioneer works [32, 48, 63, 64, 72] learn generalizable representations from HowTo100M  by either applying various proxy tasks , or curating more tailored-made supervisions (e.g., future utterance  and QA pairs ) for VideoQA. However, they focus on answering questions that demand the holistic recognition  or shallow description , and their performances on visual relation reasoning [20, 59] remains unknown. Furthermore, recent works [3, 70] reveal that these models may suffer from performance lose on open-domain questions due to the heavy noise [1, 39] and limited data scope of HowTo100M. Recent efforts tend to use open-domain vision-text data for end-to-end learning. ClipBERT  takes advantage of image-caption data [7, 27] for pretraining, but it only has limited performance improvement on temporal reasoning tasks , as the temporal relations are hard to learn from static images. In addition, ClipBERT relies on human annotated descriptions which are expensive to annotate and hard to scale up. More recent works [15, 70] collect million-scale user-generated (vastly abundant on the Web) vision-text data [3, 51, 70] for pretraining, but suffers from huge computational cost to train on such large-scale datasets. Two latest works [6, 12] reveal the potential of Transformers for learning on the target datasets (relatively small scale). While promising, they either target at revealing the single-frame bias of benchmark datasets by using image-text pretrained features (e.g. from CLIP ), or only demonstrate the model’s effectiveness on synthesized data . Overall, the poor-dynamic-reasoning and data-hungry problems in existing transformer-style video-language models largely motivate this work. To alleviate these problems, we explicitly model the objects and relations for dynamic visual reasoning and incorporate structure priors (or relational inductive bias ) into transformer architectures to reduce the demand on data.
Graph Transformer. The connection between graph neural networks and Transformer has earned increasing attention [56, 66, 69]. Nonetheless, the major advancements are made in modelling natural graph data (e.g. social connections) by either incorporating graph expertise (e.g., node degrees) into self-attention block of Transformer , or designing transformer-style convolution blocks to fuse information from heterogeneous graphs . A recent work  combines graphs and Transformers for video dialogues. Yet, it simply applies global transformer over pooled graph representations built from static frames and does not explicitly encode object and relation dynamics. Our work differs from it by designing and learning dynamic visual graph over video objects and using transformers to capture the temporal dynamics at both local and global scopes.
Given a video v and a question q, VideoQA aims to combine the two stream information v and q to predict the answer a. Depending on the task settings, a can be given in multiple choices along with each question for multi-choice QA, or it is given in a global answer set for open-ended QA. In this work, we handle both types of VideoQA by optimizing the following objective:
in which can be corresponding to the candidate answers of each question in multi-choice QA, or corresponding to the global answer set in open-ended QA. denotes the mapping function with learnable parameters .
To solve the problem, we design a video graph transformer (VGT) model to perform the mapping in Eqn. (1). As illustrated in Fig. 1, at the visual part (Orange), VGT takes as input visual object graphs, and drives a global feature with the integration of textual information, to represent the query-relevant video content. At the textual part (Blue), VGT extracts the feature
representations for all the candidate answers via a language model (e.g., BERT ). The final answer is determined by returning the candidate answers with maximal similarity (relevance score) between and via dot-product. At the heart of the model is the dynamic graph transformer module (DGT). The module clip-wisely reasons over the input graphs, and aggregates them into a sequence of feature representations which are then fed to a global transformer to achieve . During training, the whole framework is end-to-end optimized with Softmax cross-entropy loss. For pretraining with weakly-paired video-text data, we adopt cross-modal matching as the major proxy task and optimize the model in a contrastive manner  along with masked language modelling .
3.2 Video Graph Representation
Given a video, we sparsely sample frames in a way analogous to . The frames are evenly distributed into clips of length . For each sampled frame (see Fig. 2), we extract RoI-aligned features as object appearance representations along with their spatial locations with a pretrained object detector [2, 45], where represents the -th object region in a frame. Additionally, we obtain an image-level feature for all the sampled frames with a pretrained image classification model . serve as global contexts to augment the graph representations aggregated from the local objects.
To find the same object across different frames within a clip, we define a linking score by considering their appearance and spatial location:
denotes the cosine similarity between two detected objectsand in adjacent frames. Intersection-over-union (IoU) computes the location overlap of objects and . Our experiments always set as one. The detected objects in the first frame of each clip are designated as anchor objects. Detected objects in consecutive frames are then linked to the anchor objects by greedily maximizing frame by frame222We assume that the group of objects do not change in a short video clip.. By aligning objects within a clip, we ensure the consistency of the node and edge representations for the graphs constructed at different frames.
Next, we concatenate the object appearance and location representations and project the combined feature into the -dimensional space via
where denotes feature concatenation and is obtained by applying a convolution over the relative coordinates as in . The function
denotes a linear transformation with parameters. With , the relations in the -th frame can be initialized as pairwise similarities:
where and denote linear transformations with parameters and respectively. We use different transformations to reflect the asymmetric nature of real-world subject-object interactions [26, 58]. For symmetric relations, we expect that the learned parameters and are quite similar. is the Softmax operation that normalizes each row. For brevity, we use to denote the graph representation of the -th frame where are node representations and are edge representations of the graph.
3.3 Dynamic Graph Transformer
Our dynamic graph transformer (DGT) takes as input a set of visual graphs clip-wisely, and outputs a sequence of representations by mining the temporal dynamics of objects and their spatial interactions. To this end, we sequentially operate a temporal graph transformer unit, a spatial graph convolution unit and a hierarchical aggregation unit as detailed below.
3.3.1 Temporal Graph Transformer
As illustrated in Fig. 3, the temporal graph transformer unit takes as input a set of graphs and outputs a new set of graphs by mining the temporal dynamics among them via a node transformer (NTrans) and an edge transformer (ETrans). For completeness, we briefly recap the self-attention in Transformer . It uses a multi-head self-attention (MHSA) to fuse a sequence of input features :
where is a linear transformation with parameters , and
where , and
denote the linear transformations of the query, key, and value vectors of the-th self-attention (SA) head respectively. denotes the number of self-attention heads, and SA is defined as:
in which is the dimension of the key vector. Finally, a skip-connection with layer normalization (LN) is applied to the output sequence . can undergo more MHSAs depending on the number of transformer layers.
In temporal graph transformer, we apply self-attention blocks to enhance the node (or object) representations by aggregating information from other nodes of the same object from all adjacent frames within a clip:
in which denotes a sequence of feature representations corresponding to object in a video clip of length . Our motivation behind the node
transformer is that it models the change of single object behaviours and thus infer the dynamic actions (e.g. bend down). Also, it is helpful in improving the objects’ appearance feature in the cases where the object at certain frames suffer from motion blur or partial occlusion.
Based on the new nodes , we update the relation matrix via Eqn. (4). Then, to explicitly model the temporal relation dynamics, we apply an edge transformer on the updated relation matrices:
where () is the adjacency matrices that are row-wisely expanded. Our motivation is that the relations captured at static frames may be spurious, trivial or incomplete. The edge transformer can help to calibrate the wrong relations and recall the missing ones. For brevity, we refer to the temporally contextualized graph at the -th frame as .
3.3.2 Spatial Graph Convolution
The temporal graph transformer focuses on temporal relation reasoning. To reason over the object spatial interactions, we apply a -layer graph attention convolution  on all the graphs:
where is the graph parameters at the -th layer.
is the identity matrix for skip connections.are initialized by the output node representations as aforementioned. The index is omitted for brevity. A last skip-connection: is used to obtain the final node representations.
3.3.3 Hierarchical Aggregation
The node representations so far have explicitly token into account the objects’ spatial and temporal interactions. But such interactions are mostly atomic. To aggregate these atomic interactions into higher-level video elements, we adopt a hierarchical aggregation strategy in Fig. 4.
First, we aggregate the graph nodes at each frame by a simple attention:
where is linear transformation with parameters . The graph representation captures a local object interactions. It may lose sight of a global picture of a frame, especially since we only retain objects and cannot guarantee that they include all the objects of interest in that frame. As such, we complement with the frame-level feature by concatenation:
in which and are linear transformations with parameters and respectively. We next pool the local interactions to obtain a sequence of clip-level feature representations via:
The set of clips are finally represented by .
3.4 Cross-modal Interaction
To find the informative visual contents with respect to a particular text query, a cross-model interaction between the visual and textual nodes is essential. Given a set of visual nodes denoted by , we integrate textual information into the visual nodes via a simple cross-modal attention:
where is the number of tokens in the text query. In principle, the can be visual representations from different levels of the DGT module similar to . In our experiment, we explore perfomring the cross-modal interaction with visual representations at the object-level ( in Eqn. (3)), frame-level ( in Eqn. (12)), and clip-level ( in Eqn. (13)). We find that the results vary among different datasets. As a default, we perform cross-modal interaction at the output of the DGT module (i.e. ), since the number of nodes at this stage is much smaller, and the node representations have already absorbed the information from the preceding layers. For the text node , we obtain them by a simple linear projection on the token outputs of a language model :
. The text query Q can be questions in open-end QA or QA pairs in multi-choice QA. Note that in multi-choice QA, we max-pool the obtained query-aware visual representations with respect to different QA pairs to find the one that is mostly relevant to the video.
3.5 Global Transformer
The aforementioned DGT module pays attention to extract informative visual clues from video clips. To capture the temporal dynamics between these clips, we employ another -layer transformer over the cross-modal interacted clip feature (i.e. ), and add learnable sinusoidal temporal position embeddings . Finally, the transformer’s outputs are mean-pooled to obtain the global representation for the entire video, which is defined as follows:
The global transformer has two major advantages: 1) It retains the overall hierarchical structure which progressively drives the video elements at different granularity as in . 2) It improves the feature compatibility of vision and text, which may benefit cross-modal comparison.
3.6 Answer Prediction
To obtain a global representation for a particular answer candidate, we mean-pool its token representations from BERT by where denotes a candidate answer’s token representations, and is obtained in a way analogous to Eqn. (15). Its similarity with the query-aware video representation is then obtained via a dot-product. Consequently, the candidate answer of maximal similarity is returned as the final prediction:
in which , and denotes the number of candidate answers. Additionally, for open-ended QA, we follow previous works  and enable a video-absent QA by directly computing the similarities between the question representation (obtained in a way similar to ) and the answer representations . As a result, the final answer can be a joint decision:
in which is element-wise product. During training, we maximize the VQ, A
similarity corresponding to the correct answer of a given sample by optimizing the Softmax cross entropy loss function.where is the matching score for the -th sample. if the answer index corresponds to the -th sample’s ground-truth answer and 0 otherwise.
3.7 Pretraining with Weakly-Paired Data
For cross-model matching, we encourage the representation of each video-text interacted representation to be closer to that of its paired description and be far away from that of negative descriptions which are randomly collected from other video-text pairs in each training iteration. This is formally achieved by maximizing the following contrastive objective:
where denotes the representations of all the negative video-description pairs of the -th sample. The parameters to be optimized are hidden in the process of calculating and as introduced above. For negative sampling, we sample them from the whole training set at each iteration. For masked language modelling, we only corrupt the positive description of each video for efficiency.
4.1 Dataset and Configuration
We conduct experiments on benchmarks whose QAs feature temporal dynamics: 1) NExT-QA  is a manually annotated dataset that features causal and temporal object interaction in space-time. 2) TGIF-QA  features short GIFs; it asks questions about repeated action recognition, temporal state transition and frame QA which invokes a certain frame for answer. For better comparison, we also experiment on MSRVTT-QA  which challenges a holistic visual recognition or description. Other data statistics are presented in Appendix 0.A.
We decode the video into frames following , and then sparsely sample frames from each video. The frames are distributed into clips whose length . For each frame, we detect and keep regions of high confidence for NExT-QA (Top-5 are used in the pretraining-free experiments, refer to our analysis in Appendix 0.C.2 ), and for the other datasets, using the object detection model provided by . The dimension of the models’ hidden states is . The default number of layers and self-attention heads in transformer are and ( for edge transformer in DGT) respectively. Besides, the number of graph layers is . For training, we use Adam optimizer with initial learning rate
of a cosine annealing schedule. The batch size is set to 64, and the maximum epoch varies from 10 to 30 among different datasets. Our pretraining data (M) are collected from WebVid . More details are presented in Appendix 0.B.
4.2 Sate-of-the-Art Comparison
In Table 1, we compare VGT with the prior arts on NExT-QA . The results show that VGT surpasses the previous SoTAs by clear margins on both the val and test sets, improving the overall accuracy by 1.6% and 1.9% respectively. VGT even outperforms a latest work ATP  which is based on CLIP features  (VGT vs. ATP: 55.02% vs. 54.3%), and thus sets the new SoTA results. In particular, we note that such strong results come without considering large-scale cross-modal pretraining. When pretraining VGT with (relatively) small amount of data, we can further increase the results to 56.9% and 55.7% on NExT-QA val and test sets respectively (refer to our analysis of Table 5 in Sec. 4.4).
|Method||CM-Pretrain||NExT-QA Val||NExT-QA Test|
Compared with VQA-T  which also formulates VideoQA as problem of similarity comparison instead of classification, VGT outperforms it almost in all metrics. The strong results could be due to that VGT explicitly models the object interactions and dynamics for visual reasoning, instead of holistically encoding video clips with S3D [39, 61]. For a better analysis, we further replace the S3D encoder in VQA-T with our DGT module. As shown in Table 2 (S3D DGT), our DGT encoder significantly improves VQA-T’s result by 4.7%, in which most of the improvements are from answering reasoning
|Models||Size (M)||NExT-QA Val|
type of questions. Aside from the DGT module, we encode the candidate answers in the context of the corresponding question with a single language model, whereas VQA-T encodes Q and A independently with two language models . Our method improves answer encoding with contexts and reduces the model size (or parameters), as shown in Table 2 (VGT (DistilBERT)). Finally, VQA-T adopts cross-modal transformer to fuse the video-question pair, whereas we design light-weight cross-modal interaction module. The module is more parameter efficient but has little impact on the performances (CMTransCM in Table 2).
Compared with other graph based methods [9, 23, 60], VGT enjoys several advantages: 1) It explicitly model the temporal dynamics of both objects and their interactions. 2) It solves VideoQA by explicit similarity comparison between the video and text instead of classification. 3) It represents both visual and textual data with Transformers which may improve the feature compatibility and benefit cross-modal interaction and comparison . 4) VGT uses much few frames for training and inference (e.g., VGT vs. HQGA : 32 vs. 256), which benefits efficiency for video encoding. The detailed analyses are given in Sec. 4.3.
In Table 3, we compare VGT with previous arts on the TGIF-QA and MSRVTT-QA datasets. The results show that VGT performs pretty well on the tasks of repeating action recognition and state transition that feature temporal dynamics, surpassing the previous pretraining-free SoTA results significantly by 10.6% (VGT vs. MASN : 95.0% vs. 84.4%) and 6.8% (VGT vs. MHN : 97.6% vs. 90.8%) respectively. It even beats the pretraining SoTA (i.e. MERLOT ) by about 1.0%, yet without using external data for cross-modal pretraining. On TGIF-QA-R  which is curated by making the negative answers in TGIF-QA more challenging, we can also observe remarkable improvements. Besides, VGT also achieves competitive results on normal descriptive QA tasks as defined in FrameQA and MSRVTT-QA though they are not our focus.
|SiaSRea ||VG+COCO Caption||79.7||85.3||60.2||-||-||41.6|
|MERLOT ||Youtube180M, CC3M||94.0||96.2||69.5||-||-||43.1|
4.3 Model Analysis
DGT. The middle block of Table 4 shows that removing the DGT module (w/o DGT) (i.e. directly summarizing the object representations in each clip) leads to clear performance drops (2.0%) on all tasks that challenge spatio-temporal reasoning. We then study the temporal graph transformer module (w/o TTrans) by removing both NTrans and ETrans. It shows better results than removing the whole DGT module. Yet, its performances on tasks featuring temporal dynamics are still weak. We further ablate the temporal graph transformer module to investigate the independent contribution of the node transformer (NTrans) and edge transformer (ETrans). The results (w/o NTrans and w/o ETrans) demonstrate that both transformers benefit temporal dynamic modelling. Finally, the ablation study on the global frame feature reveals its vital role to DGT.
Similarity Comparison vs. Classification. We study a model variant by concatenating the outputs of the DGT module with the token representations from BERT in a way analogous to ClipBERT . The formed text-video representation sequence is fed to a cross-modal transformer for information fusion. Then, the output of the ‘[CLS]’ token is fed to a
-way classifier in open-ended QA or a-way classifier for binary relevance in multi-choice QA following [20, 28, 60]. As can be seen from the bottom part of Table 4, this classification model variant (Comp CLS) leads to drastic performance drops. To be complete, we also conduct additional experiments on the FrameQA task which is set as open-ended QA. Again, we find that the accuracy drops from 61.6% to 56.9%. A detailed analysis of the performances on the training and validation sets (see Appendix 0.C.1) reveals that the CLS-model suffers from serious over-fitting on the target datasets. The experiment demonstrates the superiority of solving QA by relevance comparison instead of answer classification.
suggest that it is better to integrate textual information at both the frame- and clip-level outputs (CM-CF) for TGIF-QA, while our default interaction at the clip-level outputs (CM-C) brings the optimal results on NExT-QA. Compared with the baselines that do not use cross-modal interaction, all three kinds of interactions improve the performances. We notice that the cross-modal interaction improves the accuracy on TGIF-QA by more than 10%. A possible reason is that the GIFs are trimmed short videos that only contain the QA-related visual contents. This greatly eases the challenge in spatial-temporal grounding of the positive answers, especially when most of the negative answers are not presence in the short GIFs. Thus, the cross-modal interaction performs more effectively on this dataset. The videos in NExT-QA are not trimmed, thereby the improvements are relatively smaller. Base on these observations, we perform cross-modal interaction at both the frame- and clip-level outputs for the temporal reasoning tasks in TGIF-QA, and keep the default implementation for other datasets.
4.4 Pretraining and Finetuning
Table 5 presents a comparison between VGT with and without pretraining. We can see that pretraining can steadily boost the QA performance, especially on NExT-QA. The relatively smaller improvements on TGIF-QA could be due to that TGIF-QA dataset is large, and has enough annotated data for fine-tuning. As such, pretraining helps little . Besides, we find that finetuning with masked language modelling (MLM) can improve the generalization from val to test set, and thus achieves the best overall accuracy (i.e. 55.7%) on NExT-QA test set. Fig. 6 studies the QA performances on NExT-QA val set with respect to different amounts of pretraining data. Generally, there is a clear tendency of performance improvements for the overall accuracy (Acc@All) when more data is available. A more detailed analysis shows that these improvements mostly come from a stronger performance in answering causal (Acc@C) and descriptive (Acc@D) questions. For temporal questions, it seems that pretraining with more data does not help much. Therefore, to boost performance, it is promising to add more data or explore a better way to handle temporal languages.
|Methods||TGIF-QA||NExT-QA Val||NExT-QA Test|
|VGT (FT w/ QA)||60.2||71.0||53.93||56.20||70.14||57.19||51.73||53.78||67.05||54.88|
|VGT (FT w/ QA & MLM)||60.5||71.5||53.43||56.39||69.50||56.89||52.78||54.54||67.26||55.70|
4.5 Qualitative Analysis
In Fig. 7, we qualitatively analyze the benefits of both dynamic graph transformer and pretraining. The example in (a) shows that the model without the DGT module is prone to predicting atomic or contact actions (e.g. ‘grab’) that can be captured at static frame-level. (b) shows that the model without pretraining fails to predict the answer that is highly abstract (e.g. ‘adjust’). Finally, we show a failure case in (c). It indicates that our model tends to predict distractor answers that are semantically close to the questions when the object of interests in the video are small and the detector fails to detect it. Keeping more detected regions could be helpful, but one needs to carefully balance the graph complexity as well as the inference efficiency. Another alternative is to perform modulated detection as in , we leave it for future exploration.
We presented video graph transformer which explicitly exploits the objects, their relations, and dynamics, to improve visual reasoning and alleviate the data-hungry issue for VideoQA. Our extensive experiments show that VGT can achieve superior performances as compared with previous SoTA methods on tasks that challenge temporal dynamic reasoning. The performance even surpasses those methods that are pretrained on large-scale vision-text data. To study the learning capacity of VGT, we further explored pretraining on weakly-paired video-text data and obtained promising results. With careful and comprehensive analyses of the model, we hope this work can encourage more efforts in designing effectiveness models to alleviate the burden of handling large-scale data, and also promote VQA research that goes beyond a holistic recognition/description to reason about the fine-grained video details.
This research is supported by the Sea-NExT joint Lab. Major work was done when Junbin was a research intern at Sea AI Lab. We greatly thank Angle Yao as well as the anonymous reviewers for their thoughtful comments towards a better work.
Noise estimation using density estimation for self-supervised multimodal learning. In AAAI Conference on Artificial Intelligence (AAAI), Vol. 35, pp. 6644–6652. Cited by: §2.
Bottom-up and top-down attention for image captioning and visual question answering. In
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6077–6086. Cited by: §3.2, §4.1.
-  (2021) Frozen in time: a joint video and image encoder for end-to-end retrieval. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 1728–1738. Cited by: Appendix 0.B, §2, §4.1.
Relational inductive biases, deep learning, and graph networks. arXiv preprint arXiv:1806.01261. Cited by: §2.
-  (2021) Is space-time attention all you need for video understanding?. In ICML, pp. 813–824. Cited by: §1.
-  (2022) Revisiting the” video” in video-language understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2917–2927. Cited by: §2, §4.2.
-  (2015) Microsoft coco captions: data collection and evaluation server. arXiv preprint arXiv:1504.00325. Cited by: §2.
-  (2020) Uniter: universal image-text representation learning. In European Conference on Computer Vision (ECCV), pp. 104–120. Cited by: §1.
-  (2022) (2.5+ 1) d spatio-temporal scene graphs for video question answering. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 36, pp. 444–453. Cited by: §4.2, Table 1.
-  (2021-08) Hierarchical object-oriented spatio-temporal reasoning for video question answering. In IJCAI, Cited by: Table 3.
-  (2019) Bert: pre-training of deep bidirectional transformers for language understanding. In NAACL, Cited by: §1, §3.1, §3.4, §3.5, §4.2.
-  (2021) Attention over learned object embeddings enables complex visual reasoning. Advances in neural information processing systems (NeurIPS) 34. Cited by: §2.
-  (2020) An image is worth 16x16 words: transformers for image recognition at scale. In International Conference on Representation Learning (ICLR), Cited by: §1.
Heterogeneous memory enhanced multimodal attention model for video question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1999–2007. Cited by: §1, §2.
-  (2021-11) VIOLET: end-to-end video-language transformers with masked visual-token modeling. In arXiv preprint arXiv:2111.12681, Cited by: §1, §2.
-  (2018) Motion-appearance co-memory networks for video question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6576–6585. Cited by: §1, §2.
-  (2021) Dynamic graph representation learning for video dialog via multi-modal shuffled transformers. In AAAI Conference on Artificial Intelligence (AAAI), Cited by: §2.
-  (2016) Deep residual learning for image recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778. Cited by: §1, §3.2.
-  (2020) Location-aware graph convolutional networks for video question answering. In AAAI Conference on Artificial Intelligence (AAAI), Vol. 34, pp. 11021–11028. Cited by: §2, Table 3.
-  (2017) Tgif-qa: toward spatio-temporal reasoning in visual question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2758–2766. Cited by: Table 6, Appendix 0.A, §1, §1, §1, §2, §2, §4.1, §4.3.
-  (2021) Scaling up visual and vision-language representation learning with noisy text supervision. In ICML, pp. 4904–4916. Cited by: §1.
-  (2020) Divide and conquer: question-guided spatio-temporal contextual attention for video question answering. In AAAI Conference on Artificial Intelligence (AAAI), Vol. 34, pp. 11101–11108. Cited by: §2.
-  (2020) Reasoning with heterogeneous graph alignment for video question answering. In AAAI Conference on Artificial Intelligence (AAAI), Cited by: §1, §2, §4.2, Table 1, Table 3.
-  (2021) MDETR-modulated detection for end-to-end multi-modal understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 1780–1790. Cited by: §4.5.
-  (2017) Semi-supervised classification with graph convolutional networks. In International Conference on Representation Learning (ICLR), Cited by: §3.3.2.
-  (2018) Referring relationships. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6867–6876. Cited by: §3.2.
-  (2017) Visual genome: connecting language and vision using crowdsourced dense image annotations. IJCV 123 (1), pp. 32–73. Cited by: §2.
-  (2020) Hierarchical conditional relation networks for video question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9972–9981. Cited by: §1, §4.3, Table 3.
-  (2021) Less is more: clipbert for video-and-language learning via sparse sampling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7331–7341. Cited by: §1, §2, §4.3, Table 3.
-  (2018) TVQA: localized, compositional video question answering. In Empirical Methods in Natural Language Processing (EMNLP), Cited by: §1.
-  (2021) Align before fuse: vision and language representation learning with momentum distillation. In Advances in neural information processing systems (NeurIPS), Vol. 34. Cited by: §1.
-  (2020) HERO: hierarchical encoder for video+ language omni-representation pre-training. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 2046–2065. Cited by: §2.
-  (2019) Beyond rnns: positional self-attention with co-attention for video question answering. In AAAI Conference on Artificial Intelligence (AAAI), pp. 8658–8665. Cited by: §2.
-  (2020) Oscar: object-semantics aligned pre-training for vision-language tasks. In European Conference on Computer Vision (ECCV), pp. 121–137. Cited by: §1.
-  (2022) Invariant grounding for video question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2928–2937. Cited by: §2, Table 1.
-  (2021-10) HAIR: hierarchical visual-semantic relational reasoning for video question answering. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 1698–1707. Cited by: §2, Table 3.
-  (2022) Video swin transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3202–3211. Cited by: §1.
-  (2019) Vilbert: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In Advances in neural information processing systems (NeurIPS), pp. 13–23. Cited by: §1.
-  (2020) End-to-end learning of visual representations from uncurated instructional videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9879–9889. Cited by: §2, §4.2.
-  (2019) Howto100m: learning a text-video embedding by watching hundred million narrated video clips. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 2630–2640. Cited by: §2.
-  (2021) Bridge to answer: structure-aware graph interaction network for video question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 15526–15535. Cited by: §2, Table 3.
-  (2021) Progressive graph attention network for video question answering. In ACM MM, pp. 2871–2879. Cited by: Appendix 0.A, §2, §4.2, Table 3.
-  (2022) Multilevel hierarchical network with multiscale sampling for video question answering. IJCAI. Cited by: §4.2, Table 3.
-  (2021) Learning transferable visual models from natural language supervision. In ICML, pp. 8748–8763. Cited by: §1, §2, §3.1, §4.2.
-  (2015) Faster r-cnn: towards real-time object detection with region proposal networks. Advances in neural information processing systems (NeurIPS) 28. Cited by: §1, §3.2.
-  (2019) DistilBERT, a distilled version of bert: smaller, faster, cheaper and lighter. Advances in neural information processing systems (NeurIPS). Cited by: §0.C.3, §4.2.
-  (2021) Attend what you need: motion-appearance synergistic networks for video question answering. In ACL, pp. 6167–6177. Cited by: §1, §2, §4.2, Table 3.
-  (2021) Look before you speak: visually contextualized utterances. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 16877–16887. Cited by: §1, §2.
-  (2019) Annotating objects and relations in user-generated videos. In Proceedings of the 2019 on International Conference on Multimedia Retrieval (ICMR), pp. 279–287. Cited by: §2.
-  (2019) Relation understanding in videos: a grand challenge overview. In Proceedings of the 27th ACM International Conference on Multimedia (MM), pp. 2652–2656. Cited by: §2.
-  (2018) Conceptual captions: a cleaned, hypernymed, image alt-text dataset for automatic image captioning. In ACL, pp. 2556–2565. Cited by: §2.
-  (2020) VL-bert: pre-training of generic visual-linguistic representations. In International Conference on Representation Learning (ICLR), Cited by: §1.
-  (2019) Videobert: a joint model for video and language representation learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 7464–7473. Cited by: §1.
-  (2019) Lxmert: learning cross-modality encoder representations from transformers. In EMNLP, pp. 5100–5111. Cited by: §1.
-  (2017) Attention is all you need. In Advances in neural information processing systems (NeurIPS), Vol. 30. Cited by: §1, §3.3.1.
-  (2021) Tcl: transformer-based dynamic graph modelling via contrastive learning. arXiv preprint arXiv:2105.07944. Cited by: §2.
-  (2018) Videos as space-time region graphs. In European conference on computer vision (ECCV), pp. 399–417. Cited by: §2.
-  (2020) Visual relation grounding in videos. In European Conference on Computer Vision (ECCV), pp. 447–464. Cited by: §3.2.
-  (2021) Next-qa: next phase of question-answering to explaining temporal actions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9777–9786. Cited by: Table 6, Appendix 0.A, §0.C.1, §0.C.2, Table 7, §1, §1, §2, §2, Figure 7, §4.1, Table 1.
-  (2022) Video as conditional graph hierarchy for multi-granular question answering. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), pp. 2804–2812. Cited by: §1, §2, §3.2, §3.2, §3.4, §3.5, §3.6, §4.1, §4.2, §4.2, §4.3, Table 1, Table 3.
-  (2018) Rethinking spatiotemporal feature learning: speed-accuracy trade-offs in video classification. In European Conference on Computer Vision (ECCV), pp. 305–321. Cited by: §1, §4.2.
-  (2017) Video question answering via gradually refined attention over appearance and motion. In ACM MM, pp. 1645–1653. Cited by: Table 6, §1, §2, §4.1.
-  (2021) VideoCLIP: contrastive pre-training for zero-shot video-text understanding. In EMNLP, pp. 6787–6800. Cited by: §1, §1, §2.
-  (2021) Just ask: learning to answer questions from millions of narrated videos. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 1686–1697. Cited by: Appendix 0.B, §0.C.3, §1, §2, §4.2, Table 1, Table 2.
-  (2019) CLEVRER: collision events for video representation and reasoning. In International Conference on Learning Representations (ICLR), Cited by: §2.
-  (2021) Do transformers really perform badly for graph representation?. Advances in neural information processing systems (NeurIPS) 34. Cited by: §2.
-  (2021) Learning from inside: self-driven siamese sampling and reasoning for video question answering. Advances in neural information processing systems (NeurIPS) 34. Cited by: §1, Table 3.
-  (2018) A joint sequence fusion model for video question answering and retrieval. In European Conference on Computer Vision (ECCV), pp. 471–487. Cited by: §1, §2.
Graph transformer networks. Advances in neural information processing systems (NeurIPS) 32. Cited by: §2.
-  (2021) Merlot: multimodal neural script knowledge models. In Advances in neural information processing systems (NeurIPS), Vol. 34. Cited by: §1, §2, §4.2, Table 3.
-  (2022) Video question answering: datasets, algorithms and challenges. arXiv preprint arXiv:2203.01225. Cited by: §1.
-  (2020) Actbert: learning global-local video-text representations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 8746–8755. Cited by: §1, §2.
-  (2020) Rethinking pre-training and self-training. In Advances in neural information processing systems (NeurIPS), Vol. 33, pp. 3833–3845. Cited by: §4.4.
Appendix 0.A Data Statistics
The statistical details of the experimented datasets are presented in Table 6. For better comparison with previous works, we focus on the multi-choice QA task in NExT-QA  though it has also defined open-ended QA. For TGIF-QA , we also conduct experiments on a latest version  which generates more challenging negative answers for each question in the multi-choice tasks. In particular, we further fix the ‘redundant answer’ issue as we find that there are about 10% of questions have redundant candidate answers and some of the candidate answers are even identical to the correct one. The rectified annotations will be released along with the code.
|Datasets||Main Challenges||#Videos/#QAs||Train||Val||Test||VLen (s)||QA|
|NExT-QA ||Causal & Temporal Interaction||5.4K/48K||3.8K/34K||0.6K/5K||1K/9K||44||MC|
|TGIF-QA ||Repetition Action||22.8K/22.7K||20.5K/20.5K||-||2.3K/2.3K||3||MC|
|MSRVTT-QA ||Descriptive QA||10K/ 244K||6.5K/159K||0.5K/12K||3K/73K||15||OE|
Appendix 0.B Implementation Details
For training with QA annotations, we firstly train the whole model (except for the object detection model) end-to-end, and then freeze BERT to fine-tune the other parts of the best model obtained at the st stage. The best results in the two stages are determined as final results. Note that our hyper-parameters are mostly searched on the NExT-QA validation set and kept unchanged for other datasets. The maximum epoch varies from 10 to 30 among different datasets. For pretraining with data crawled from the Web, we download about 80K video-text data (less than 5%) from WebVid2.5M
. The videos are then extracted at 5 frames per second and are processed in the same way as for QA. We then optimize the model with an initial learning rate of and batch size 64. The number of negative descriptions of a video for cross-modal matching is set to 63, and they are randomly selected from the descriptions of other videos in the whole training set. Besides, a text token is corrupted at a probability of 15% in masked language modelling. Following
and batch size 64. The number of negative descriptions of a video for cross-modal matching is set to 63, and they are randomly selected from the descriptions of other videos in the whole training set. Besides, a text token is corrupted at a probability of 15% in masked language modelling. Following, a corrupted token will be replaced with 1) the ‘[MASK]’ token by a chance of 80%, 2) a random token by a chance of 10%, and 3) the same token by a chance of 10%. We train the model by maximal 2 epochs which gives to the best generalization results, and it takes about 2 hours.
Appendix 0.C Additional Model Analysis
0.c.1 Similarity Comparison vs. Classification
To study the reason for the poor performance of the classification model variant described in Sec. 4.3 of the main text, we visualize the training and validation accuracy with regard to different training epochs in Fig. 8. The results indicate that the classification model variant suffers from serious over-fitting issues, especially on NExT-QA  whose QA contents are relative complex but with less training data. To study whether the problem comes from the classification formulation or the cross-modal transformer, we further substitute the cross-modal transformer (CM-Trans) with our cross-modal interaction (CM) module introduced in Sec. 3.4 of the main text. We find that such a substitution can slightly alleviate the problem. For example, on NExT-QA val set, the accuracy increases from 45.82% to 46.98%. Nevertheless, the performance is still much worse than a comparison-based model implementation (i.e. 55.02%). This experiment reveals two facts: 1) Formulating QA problem as classification is the major cause for the weak performance. 2) The cross-modal transformer exacerbates the over-fitting problem, possibly because it involves additional parameters.
0.c.2 Study of Video Sampling
In Fig. 9, we study the effect of sampled video clips and region proposals on NExT-QA  test set. Regarding the number of sampled video clips, we find that the setting of 8 clips steadily wins on 4 clips. This is understandable as the videos in NExT-QA are relatively long. As for the sampled regions, when learning the model from scratch, the setting of 5 regions gives relatively better result, e.g., 53.68%. Nonetheless, when pretraining are considered, the setting of 20 regions gives better result, e.g., 55.70%. Such difference could be due to that learning with more regions can yield over-fitting issues when the dataset is not large enough, since the constructed graph become much larger and more complex. Our speculation is also supported by the fact that the accuracy increases with the number of sampled regions when we only sample 4 video clips and thus less number of total graph nodes.
0.c.3 Model Efficiency
|Models||Acc@All||#Params (M)||GPU Memory||Time|
We compare VGT with VQA-T  in Tab. 7 for better understanding of the memory and time cost. Experiments are done on 1 Tesla V100 GPU with batch size 64. We use 1 example to report inference FLOPs. Memory: VGT has less training parameters (133.7M vs. 156.5M) and thus smaller model size than VQA-T (511M vs. 600M). The BERT encoder in VGT takes 82% of the parameters, the vision part is lightweight with only 24M parameters. VGT needs more GPU memory for training. Yet, the memory for inference are fairly small and close to that of VQA-T. We also implement a smaller version of VGT by replacing BERT with DistilBERT  as in VQA-T. With nearly 0.6 number of VQA-T’s parameters (90.5/156.5M), we can still achieve strong performances (i.e. 53.46%). Time: Our FLOPs on 1 example is 2.9 that of VQA-T and 1.6 if we use DistilBERT. However, VGT converges much faster and needs much fewer epochs (total FLOPs) to get results superior to VQA-T when training with the same data. For example, on NExT-QA, VGT’s result at epoch 2 (50.16%) already significantly surpasses VQA-T’s best result (45.30%) achieved at epoch 8. Also, VGT’s result without pretraining can surpasses that of VQA-T pretrained with million-scale data. In this sense, VGT needs much fewer total FLOPs than VQA-T and other similar pretrained models for visual reasoning.