Log In Sign Up

Dense but Efficient VideoQA for Intricate Compositional Reasoning

by   Jihyeon Lee, et al.
Kakao Corp.
HanYang University

It is well known that most of the conventional video question answering (VideoQA) datasets consist of easy questions requiring simple reasoning processes. However, long videos inevitably contain complex and compositional semantic structures along with the spatio-temporal axis, which requires a model to understand the compositional structures inherent in the videos. In this paper, we suggest a new compositional VideoQA method based on transformer architecture with a deformable attention mechanism to address the complex VideoQA tasks. The deformable attentions are introduced to sample a subset of informative visual features from the dense visual feature map to cover a temporally long range of frames efficiently. Furthermore, the dependency structure within the complex question sentences is also combined with the language embeddings to readily understand the relations among question words. Extensive experiments and ablation studies show that the suggested dense but efficient model outperforms other baselines.


page 1

page 13

page 14


AGQA: A Benchmark for Compositional Spatio-Temporal Reasoning

Visual events are a composition of temporal actions involving actors spa...

Extending Compositional Attention Networks for Social Reasoning in Videos

We propose a novel deep architecture for the task of reasoning about soc...

A Better Way to Attend: Attention with Trees for Video Question Answering

We propose a new attention model for video question answering. The main ...

Cut-Based Graph Learning Networks to Discover Compositional Structure of Sequential Video Data

Conventional sequential learning methods such as Recurrent Neural Networ...

Compositional Structure Learning for Sequential Video Data

Conventional sequential learning methods such as Recurrent Neural Networ...

AGQA 2.0: An Updated Benchmark for Compositional Spatio-Temporal Reasoning

Prior benchmarks have analyzed models' answers to questions about videos...

Compositional Structure Learning for Action Understanding

The focus of the action understanding literature has predominately been ...

1 Introduction

Along with the immense success of deep learning methods to understand the contents of images and text, various applications requiring complex reasoning have been proposed. Especially, visual question answering (VQA) 

[Antol_2015_ICCV] is one of the most important tasks, which asks a diverse set of questions about the visual contents and requires understanding the semantic structures inherent in the contents. By virtue of the emergence of transformer architectures and their pre-training schemes, the performance of the VQA has shown successful performance [lu2019vilbert, tan2019lxmert]; however, it is not straightforward to apply the architectures to the video domain. Compared to the image and text, video data involves more complex semantic structures along with not only spatial but also temporal-axis. As described in Figure 1, long videos inevitably contain multiple events, and the events can have multiple and complex correlations. Therefore, it is important to temporally ground the multiple events and their semantic structure. * indicates equal contribution

Figure 1: Example of an intricate VideoQA problem. The semantic elements of the video, which consists of characters, their actions, and the relationship between the characters, is continually changing along with temporal axis. Therefore, it is hard to answer the questions requiring understanding the complex semantic structure.

Most of the previous datasets proposed for VideoQA consist of relatively short clips containing an event or a single action class, with relatively easy questions [jang-IJCV-2019, mario, lei2019tvqa, xu2016msr-vtt]. For this reason, the understanding of the short clips can be sufficiently addressed with image-based architectures by selecting a few representative frames from the clips. However, in the case of long videos having various events and complex relationships between the events, conventional architectures struggle to learn over large timescales. For these cases, it is essential to address the temporal grounding of the various events by considering enough frames within videos.

In this paper, we suggest a novel video/text understanding method for intricate VideoQA tasks, which consist of complicated questions requiring multiple reasoning steps. The two main ideas of the suggested methods are 1) efficiently sampling as many as possible informative visual features from the videos to learn the inherent temporal semantic structures and 2) considering the hierarchical dependency model to understand complex questions requiring multiple reasoning steps.

First of all, we suggest a deformable sampling module, which allows dense but efficient visual token sampling. Obviously, the conventional sparse sampling method [lei2021less]

, selecting a few frames followed by temporal pooling to get a single feature vector to be applied to downstream tasks, causes incomplete understanding for the long and intricate videos. As can be seen in Figure 

1, at least 3 intervals from the video should be considered to get a correct answer to the question. Unfortunately, there is a fundamental trade-off between computational cost and the number of frames to be learned with the model. To settle the problem, we introduce a deformable attention module that effectively selects a subset of meaningful visual features along spatio-temporal axis. Specifically, the suggested method considers the semantics of the given query sentence.

Secondly, we introduce a dependency attention module to learn dependency-aware feature vectors of question tokens. As the input videos contain a more complex semantic structure, getting complicated questions requiring multiple reasoning steps are inevitable. Therefore, it is necessary to take into account the semantic structure within questions to learn desirable embeddings of the question tokens. We suggest leveraging the semantic structure from the dependency parse tree of the questions. By combining the deformable sampling module and the dependency attention module, our method is able to deal with the intricate compositional reasoning problems.

In experiments, we evaluate our model on Action Genome Question Answering (AGQA,[GrundeMcLaughlin2021AGQA]) dataset. The AGQA dataset is one of the most challenging benchmarks for VideoQA because it requires complex reasoning steps on long videos. Extensive experiments not only show impressive quantitative results on QA accuracy but also verify the effectiveness of each module by a comprehensive ablation study.

In summary, our contributions are as follows:

  • We empirically reveal that covering a long time span is advantageous for complex problems, which needs spatio-temporal reasoning.

  • We introduce a deformable sampling-based VideoQA model, DSR, which aims to solve compositional reasoning problem.

  • Our experiments on VideoQA benchmarks show that the proposed method has the ability to perform complex spatio-temporal reasoning.

2 Related Work

Visual Question Answering

VQA is the task of understanding how two inputs, text-based questions and visual features, relate to one another, proposed by Antol  [Antol_2015_ICCV]. For image-based question answering tasks, a significant amount of works propose attention-based model architectures to fuse question and image representations [Anderson2017up-down, zhou2019vlp, Kim2018, gao2019dynamic]. Kim  [Kim2018]

show remarkable performance by utilizing a bilinear attention network that finds bilinear interactions between two modalities. Moreover, inspired by the recent success of pre-trained language models 

[devlin-etal-2019-bert, clark2020electra], universal pre-training frameworks for a vision-language representation learning achieve state-of-the-art performances not only on VQA but also on general visual-language tasks [tan2019lxmert, lu2019vilbert].

However, question-answering in the video domain is under-explored compared to those in the image domain. Contrary to the growing interest in measuring video reasoning capabilities [jang-IJCV-2019, GrundeMcLaughlin2021AGQA, lei2018tvqa, lei2019tvqa, xu2017video], existing VideoQA models mostly deal with short clip videos or simple questions [le2020hierarchical, fan-CVPR-2019, seo-etal-2021-attend, lei2021less]. Since a video is a sequence of images containing the temporal dimension, understanding richer spatio-temporal features and temporal localization of natural language is essential. To fuse the temporal feature, Fan  and Seo  attempt to utilize separate motion and appearance feature modules and integrate them with additional fusion network [le2020hierarchical, fan-CVPR-2019, seo-etal-2021-attend]. Le  [le2020hierarchical] propose a hierarchical conditional relation network to embed the video input at different granularities. However, separated modules have limitations in effectively interacting with linguistic questions, and the performances fell behind as the transformer-based model rises. Kim  [self-videoqa] propose a contrastive learning based training scheme that shows competitive performance, but only specializes in multiple-choice tasks. The current state-of-the-art model in VideoQA is ClipBERT [lei2021less], which is based on a cross-modal transformer. ClipBERT enables end-to-end learning by employing sparse sampling while it is unsuitable for intricate tasks that require advanced spatio-temporal reasoning since random sparse sampling loses several semantic structures. We propose a dense but efficient VideoQA model based on a transformer, which can maintain whole semantic structures.

Efficient Transformers

Transformer architecture [NIPS2017_3f5ee243]

has shown remarkable performance on various downstream tasks. However, computational cost and memory consumption of Transformer increase quadratically depending on the length of input sequences. There has been a surge of research interests recently in exploiting efficient transformer architectures to mitigate the problem. For example, various algorithms that approximate the quadratic cost attention matrix based on low-rank matrix factorization have been proposed in the field of natural language processing 

[wang2020linformer, choromanski2021rethinking, xiong2021nystromformer]. Also, in the vision domain, the scope of self-attention is restricted to local neighborhoods or specific axis based on the locality assumption of objects [ho2019axial, child2019generating, gberta_2021_ICML, Arnab_2021_ICCV]. However, the above algorithms are defined for the single modality. In contrast, we aim to address cross-modal sparsification based on a question conditional visual token sampling algorithm for the VideoQA task where non-local and fine-grained features are required.

3 DSR: Deformable Sampling-based VideoQA model for Compositional Reasoning

In this section, we introduce a detailed explanation of our model. We consider the intricate VideoQA problems, which require compositional spatio-temporal reasoning. Our goal is to learn a generalizable visual-reasoning representation with deformable sampling and dependency modeling.

Figure 2: The overall architecture of DSR. Two details are missed for simplification; deformable sampling is conditioned on question context embedding, and global visual features are used as additional input to the cross-modal transformer.

3.1 Transformer-based dense sampling model

We propose Deformable Sampling-based VideoQA model for compositional Reasoning (DSR), dense but efficient one that utilizes deformable sampling for video features and dependency modeling for text questions. Figure 2 gives an overall architecture of DSR, which is based on a cross-modal transformer. Each visual feature and question token are independently encoded with a vision backbone model and a language encoder, respectively. Inputs of the cross-modal transformer are conditionally sampled video features and dependency guided question tokens. We denote visual and language inputs of transformer as and , respectively, where is the number of visual tokens sampled from conditional sampling module, represents the number of question tokens, and indicates representation dimension. These embeddings of two different modalities are concatenated as input to a 12-layer transformer for cross-modal fusion, with special tokens [CLS] and [SEP].

We first uniformly sample the frames from a video, which is sufficiently dense to cover the full length of a video. However, as the length of the video increases, usage of whole dense frames becomes impossible since it can not fit into a single transformer due to memory limitations. Thus, motivated by Zhu  [zhu2020deformable], we introduce a deformable sampling module to only sample necessary visual features from full dense frames, conditionally to question embeddings. Consequently, relatively few visual features compared to initial dense features are sampled from the module. A detailed explanation of conditional sampling is stated in Section 3.2. Language inputs (i.e., question tokens) also go through the pre-stage modeling step to enable compositional reasoning. The dependency attention module forces specific attention head of transformer to understand dependency parsing structure, representing the relationship between words in a question sequence. It will be explained in Section 3.3.

The output vector of [CLS] token,

, is an aggregated representation of the entire input sequence of the cross-modal transformer, used to predict the answer. We consider all the QAs as open-ended word tasks, which choose one correct word as the answer from a pre-defined answer set of size C. We calculate a classification score by applying a linear classifier and softmax function on the final output and train the model by minimizing cross-entropy loss,


in which and is the ground truth answer label. During inference, conditionally sampled visual features and dependency modeled linguistic features are utilized to predict answers with proper reasoning, in the same manner with the training phase. In summary, our model achieves state-of-the-art performance on intricate VideoQA tasks by allowing end-to-end learning while covering temporally long and spatially fine-grained visual features, which are both important for advanced modeling. Different from the model that only observes a single or a few video clips, DSR can tackle the data that needs compositional reasoning.

3.2 Conditional Visual Feature Sampling

In this section, we describe how to effectively sample a subset of visual tokens from the long and dense feature map. Since video data has an additional temporal axis compared to image data, the feature map size of a video clip is much bigger than the feature map from an image. Thus, most of the VideoQA algorithms pool the feature map spatially [li2020hero] or temporally [lei2021less] and concatenate the sequence of visual features to question word vectors. Then, the concatenated feature is used as an input for a transformer-based QA model. However, the pooling-based approach would be sub-optimal for compositional VideoQA tasks that require long and fined-grained visual cues. Here, we assume that most visual features in spatio-temporal feature maps are redundant and uninformative for answering given questions. In the next section, we describe how to sample a few informative visual features from the dense feature map.

Figure 3: Illustration of the proposed deformable sampling module. The figure represents a single head of a single CDA layer. For simplicity, we only visualize a deformable attention procedure of one reference point, which is solely colored as blue.

Conditional Deformable Attention

Let be a dense visual feature map extracted by a visual encoder such as ResNet [he2016deep]. The , and indicate dimension, temporal length, height, and width of the feature map, respectively. Based on the 2-d deformable attention module [zhu2020deformable], we define our 3-d Conditional Deformable Attention (CDA) to sample question-conditional visual features from the spatio-temporal feature map and a given question as follows:


in which is an element index for input query vector of a transformer layer and 3-d reference point . In the first transformer layer, the input query is the learnable query, where is the number of queries which is the same as the number of sampled visual features. Also, before feeding to the first transformer layer, we make a pooled question context and add the question context to each learnable query by the broadcast vector addition, , to make CDA sample visual features conditioned on the given question context. For the rest of the layers, the is the output vector of the previous transformer layer. and denote the total number of attention heads and sampled key vectors, respectively. , , , and are learnable linear projection layers. denotes the attention weight of the sampling point in the attention head for a given query , where . is 3-d sampling offset. Since

is a real-valued vector, we apply trilinear interpolation to compute

. With CDA, we can get sampled visual tokens where is much smaller than the , e.g., 25 vs. 30 7 7. The overview of CDA is illustrated in Figure 3.

Regularization for Sampling Diversity

For the question and answering task, sampled visual tokens from CDA are concatenated with question words, and a transformer-based model takes the concatenated features to predict an answer. Thus, it is important that the sampled visual tokens should be as diverse as possible to provide sufficient information for a given question. In Deformable DETR [zhu2020deformable], offset predictions can be diverse without the collapse since each object query is trained to match a target object based on the Hungarian loss. However, in QA task, proper regularization is crucial to prevent the collapse because the model gets gradient feedback only from answering loss. To reinforce the diversity of sampled visual tokens, we explore three types of additional regularization terms. Here, we consider batched features where the and have shapes of and , respectively. The first regularization term is Soft Orthogonality (SO) [xie2017all], which is defined as follows:



indicates the index in a mini-batch. The SO aims the Gram matrix of the sampled tokens to be close to the identity matrix,

. Thus, each sampled visual token could be distinctive and independent.

The second regularization term is Maximal Coding Rate (MCR) [yu2020learning], which formulated as follows:


Maximizing the MCR results in the largest possible volume spanned by the vectors in the Gram matrix of the sampled tokens. Thus, the sampled tokens should be as independent as possible.

The last regularization term that we explore is the contrastive loss [chen2020simple]. Here, we set the anchors, positive, and negative examples as sampled visual features , the feature map , and feature maps of others in the batch, respectively.


where and are the global averaged pooled of and

, respectively. Also, we use the cosine similarity as the similarity function

and is set to 0.1 by default.

Global Context Features

The sampled visual features from our CDA represent fine-grained local information that is required to answer the given question. However, the interaction between local and global information is also crucial to solving the spatio-temporally complex QA task more accurately. Thus, we introduce additional global information that is extracted by applying the spatial pooling to feature maps . As a result, we get the global-local visual feature by concatenating the global and sampled local visual features.

Figure 4: Example of dependency structure.
Figure 5: Adjacency matrix generated from dependency relations.

3.3 Dependency Attention Module

In this section, we explain the details of the dependency attention module that extracts dependency-aware vector from question tokens. Motivated from Deguchi  [deguchi-etal-2019-dependency]

, we introduce a self-attention module that incorporates dependency relations. Previous studies show that the performance of neural machine translation has been improved by incorporating sentence structures 

[chen-etal-2017-neural, eriguchi-etal-2017-learning, wu]. While most visual-language learning tasks only rely on a pre-trained language model to encode question embeddings, we believe that comprehension of sentence structure is crucial for nonconventional questions, and the dependency-based attention module would also work for the VideoQA task.

Language features are first learned through the dependency attention module before feeding into the cross-modal transformer. The module is consist of an -layer transformer, where one attention head of the -th multi-head self-attention layer is trained with constraints based on dependency parsing value. Let is the output of previous layer. The dependency attention module first maps to -dimensional subspaces of multi-head attention as


where , , and are dimensional parameter matrices. Dependency attention weight matrix, , is calculated by the bi-affine operation [DBLP:conf/iclr/DozatM17] as follows,


where . Each value of

represents the dependency relationship between two words, and the probability of token

being the governor of token is modeled as . Then, likewise the original self-attention module, attention output is obtained by multiplying and . Finally, attention outputs of all the heads (i.e., one dependency outputs and conventional outputs) are concatenated and the rest are calculated like the conventional multi-head attention.


can be learned by additional dependency loss function as in Deguchi  

[deguchi-etal-2019-dependency], we explicitly force the correct dependency value not only in the training phase but also in inference. The gold parse provides an upper bound for using dependency relations and enables accurate structural modeling. Figure 4 shows the example of dependency relationships and Figure 5 represents the relations in an adjacency matrix that utilized as a gold value of . The gold value forces each token to only give attention to its governor.

To apply dependency relations to the transformer module, we reorganize the adjacency matrix for subword sequences. When a word is separated into multiple subwords by BPE [sennrich-etal-2016-neural], the governor (i.e., the head) of the rightmost subword is set to the governor of the original word and the governor of each subword other than the rightmost one is set to the right adjacent subword.

4 Experiment

In this section, we evaluate our proposed model on compositional spatio-temporal reasoning datasets. We first introduce the details of benchmark datasets in Section 4.1. Section 4.2 describes experimental setup including implementation details. We also provide extensive quantitative experiments and ablation studies in Section 4.3, to show how each of the proposed modules works. Lastly, we qualitatively confirm that our model samples reasonable visual frames conditioned on given questions, in supplementary materials.

4.1 Dataset

Action Genome Question Answering

We validate DSR using AGQA dataset proposed by Grunde-McLaughlin  [GrundeMcLaughlin2021AGQA], the most challenging benchmark for VideoQA. While most of the existing benchmarks only utilize short video clips, use simple and biased questions, and focus on questions that require commonsense or external knowledge, AGQA consists of long video clips with an average length of 30 seconds. Each question generated by a handcrafted program necessarily requires spatio-temporal reasoning steps.

We adopt a balanced, novel composition, and more composition version of AGQA [GrundeMcLaughlin2021AGQA]. A balanced dataset, 3.9M of QA pairs associated with 9.6K videos, minimizes the bias by balancing the answer distributions and types of question structures. Novel composition is constructed to test whether models can disentangle distinct concepts and combine them well. For example, compositions like “before standing up” are removed from the training set, while each word “before” and “standing” appear. It tests how well the model performs on questions with those novel compositions in inference. More composition tests whether models generalize to more compositional steps. The training set only contains simpler questions with compositional steps, while the test set contains only questions with reasoning steps. The model generalized to novel compositions and more compositional steps can be regarded as a successful VideoQA model that understands compositional semantic structures.

Open answer questions have many possible answers, while binary questions have answers that are yes/no or before/after. Except for the first table, all the tables adopt a 10% version of the balanced dataset for the training and inference phase. We provide details of the dataset we used in supplementary materials.


[lei2018tvqa] is an intricate multiple-choice VideoQA dataset composed of 60-90 second long video clips. Although the video clips, questions, answers, subtitles, timestamps, and objects are given from the dataset, we only utilize video clips and QA pairs to verify intricate compositional reasoning ability. Most baselines use subtitles and propose the model that maximizes the subtitle knowledge since the performance gained from subtitles is much larger than that from video, however, we claim only to use videos, questions, and answers for video question answering tasks to show the video understanding ability.

Benchmarks with short videos

MSRVTT-QA [xu2016msr-vtt] is created based on videos in MSRVTT and questions are automatically generated from video descriptions. It consists of 10k videos and 243k QA pairs, with an average video length of 15 seconds. TGIF-QA [jang2017tgif] is web GIF VQA, containing 165K QA pairs on 72K GIF videos with an average length of 3 seconds long. MSRVTT and TGIF are not only short but also easy videos. These videos only require simple spatial reasoning while AGQA requires intricate spatio-temporal reasoning. According to the original paper, MSRVTT is a set of simple clips that each can be described with a single sentence, thus confined to a single domain. Lei  [lei2021less] support it by showing that adding more clips does not improve performance for both datasets; even somewhat has a negative effect. Our model does not stand out in simple tasks since it aims to solve intricate reasoning problems by modeling dense features, but even shows competitive results on the datasets.

Types PSAC [li2019beyond] HME [fan-CVPR-2019] HCRN [le2020hierarchical] ClipBERT [lei2021less] DSR(Ours)
Fullabcde Balanced Binary 54.19 59.77 58.11 63.83 65.92 (+2.09)
Open 27.20 36.23 37.18 48.54 49.54 (+1.00)
All 40.40 47.74 47.42 53.03 54.36 (+1.33)
Novelabcde Composition Binary 43.00 52.39 43.40 53.87 59.57 (+5.70)
Open 14.80 19.46 23.72 36.45 38.73 (+2.28)
All 32.49 40.11 36.06 40.82 43.96 (+3.14)
Moreabcde Composition Binary 35.39 48.09 42.46 42.93 47.79 (-0.30)
Open 28.00 33.47 34.81 45.93 48.08 (+2.15)
All 31.13 39.70 38.00 45.32 48.02 (+2.70)
Table 1: Quantitative comparison with the baselines on AGQA datset. Full balanced, novel composition, and more composition represent different subset of AGQA as described in Section 4.1. The bold represents the best score.

4.2 Experimental Setup


We compare our approach against four recent VideoQA methods [li2019beyond, fan-CVPR-2019, le2020hierarchical, lei2021less]. PSAC [li2019beyond] utilizes a co-attention block after unimodal self-attention blocks to simultaneously attend to both modalities. HME [fan-CVPR-2019] models question, appearance, and motion features with different LSTM encoders. Additional visual and question memories help the multimodal fusion. HCRN [le2020hierarchical] designs conditional relation networks and stacks them to accommodate diverse input modalities and conditioning features. ClipBERT [lei2021less] inputs a few short clips independently to a cross-modality transformer and aggregates prediction scores from multiple clips as the final score. For PSAC, HME, and HCRN, the performances reported in Grunde-McLaughlin  [GrundeMcLaughlin2021AGQA] are utilized.

Implementation Details

2D ResNet-50 [resnet] and word embedding layers of BERT-base model [devlin-etal-2019-bert] are adopted as visual and language backbones. Specifically, 5 Conv blocks of ResNet-50 and an extra convolution layer are used for spatial down-sampling. We initialize visual/text encoder and cross-modal transformer with image-text pretrained weights proposed from ClipBERT [lei2021less], which leverages large-scale image-text datasets [coco, Krishna2016VisualGC].

We use a 4-layer transformer to construct our CDA. In each transformer layer, we set the number of attention heads and sampling points to 4 and 8, respectively. Also, we use a 2-layer transformer for the dependency attention encoder. The first attention head of the first layer corresponds to a dependency-guided self-attention module. 3D and 1D positional embeddings are applied for visual and language embeddings, respectively. We also add different type embeddings to both video and text inputs of the cross-modal transformer to indicate their source type. We report more details such as hyperparameters in supplementary materials.

4.3 Quantitative Results

Comparision with baselines on varied benchmarks

We compare DSR with state-of-the-art models on the aforementioned datasets. As shown in Table 1, DSR consistently outperforms all the baselines on AGQA dataset. Compared to the best baseline, ClipBERT, our model achieves better points of 1.33, 3.14, and 2.70 on full balanced, novel composition, and more composition datasets, respectively. We run three independent trials on Table 1,2,5

and confirmed the statistical significance of DSR via t-test. The noticeable thing is that DSR, the model concentrates on complex spatio-reasoning, especially records high scores on novel composition and more composition subsets. Since these two subsets are intentionally curated to test the generalizability and the reasoning ability of the model, the results give proof of the quality of DSR. The experimental results according to the number of compositional steps are in supplementary materials.

Table 2 shows the results on MSRVTT-QA, TGIF-QA, and TVQA dataset. We experiment over three tasks (i.e., Action, Transition, FrameQA) in TGIF-QA benchmark. Even though the MSRVTT-QA and TGIF-QA mostly only require an understanding of the spatial features rather than temporal reasoning of given questions, our method achieves a score comparable with ClipBERT. Moreover, DSR achieves the state-of-the-art result in the V+Q setting of TVQA, where subtitles and timestamps are not used for training.

Methods MSRVTT-QA Action Transition FrameQA TVQA
Co-Memory [DBLP:conf/cvpr/GaoGCN18] 32.0 68.2 74.3 51.5 -
PSAC [li2019beyond] - 70.4 76.9 55.7 -
HME [fan-CVPR-2019] 33.0 73.9 77.8 53.8 -
HCRN [le2020hierarchical] 35.6 75.0 81.4 55.9 -
QueST [jiang2020divide] 34.6 79.5 81.0 59.7 -
multi-stream [lei2018tvqa] - - - - 43.8
ClipBERT [lei2021less] 37.4 82.4 87.3 58.8 44.4
DSR(Ours) 37.2 81.7 87.6 58.3 48.8
Table 2: Experimental results on benchmark datasets.

QA Performance based on Sequence Length of Visual Features

Here, we analyze the efficiency and effectiveness of DSR when addressing long sequence visual features. We first explore how the QA accuracy varies as we increase the number of frames so that the visual feature covers a longer temporal range. In this experiment, we set fps as 1 by default. From Table 3, we observe that the QA accuracy increases as we show more frames to the model. However, since the computational cost of self-attention operations in the transformer-based QA module increases quadratically based on the input sequence length, there is a limitation to consider a more long-ranged sequence without any sparsification of visual features. On the contrary, DSR can sample a subset of informative visual features from the dense feature map, the number of visual features can be controllable as a hyperparameter. As a result, we achieve higher accuracy even with a much less number of visual tokens (57 vs. 392). More detailed analyses of memory efficiency according to sequence lengths of our DSR are discussed in supplementary materials.

Binary Open All
2 277 60.22 46.05 50.21
4 477 60.32 47.40 51.20
8 877 61.29 46.37 50.75
32 w/ DSR 32 + 25 64.47 48.58 53.24
Table 3: Accuracy based on various sequence length of visual features.

Sparse Sampling vs. Dense Sampling

In this experiment, we compare the effectiveness of randomly sampled sparse features and densely sampled features for the intricate compositional reasoning task. In the ClipBERT [lei2021less], they propose a sparse sampling-based training strategy due to the high computation cost and memory consumption. The Sparse Random in Table 4

follows the training convention of ClipBERT. They randomly sample multiple clips across the whole video, and each clip consists of 2 consecutive frames with fps 2. Then, a shared transformer-based QA model independently predicts answers based on the multiple clips. Finally, the answer logits from each clip are averaged as a final decision. In contrast to sparse sampling, dense sampling aims to see longer sequences temporally with just one clip to address the intricate spatio-temporal reasoning task, AGQA. We observe that dense sampling with DSR shows higher accuracy than sparse sampling. Since DSR can sample a few diverse informative visual features from the spatio-temporally dense feature map, the model can effectively associate question words and sampled visual features, which leads to the highest accuracy in Table 


Sampling Method Acc.
Sparse Random 2 1 50.57
2 2 52.17
2 4 52.80
2 16 52.93
Dense w/ DSR 32 1 53.24
Table 4: Comparison of sparse sampling and dense sampling strategies.

Ablation study

In this section, we conduct extensive ablation experiments about the hyperparameters of DSR. The first row in Table 5 is our best configuration among all controllable variables. Firstly, we observe that there is an improvement with the dependency attention module. The dependency encoder helps structurization of question sequences by forcing dependency relations. Then, we explore the best number of visual features to be sampled. When we increase the number of object queries from 5 to 25, there are consistent improvements in the QA accuracy. However, if we set the number of object queries to 50, the QA accuracy drops slightly. We analyze that noisy and redundant visual features could be sampled if we consider too many sampling points. Thus, we set the number of object queries to 25 by default for the experiments in prior sections.

The next ablation is about the effectiveness of the global context features. From the sixth and seventh rows in Table 5, we observe that the global context features are notably helpful to boost the QA accuracy. While the “only local” model shows lower accuracy than the “only global” model, we achieve the best performance with the combination of global features and local features. This indicates that a proper association of global and local features is crucial to address the complex spatio-temporal reasoning task.

Dep. # of Obj. q Glob. Reg. Sampl. Acc.
TextRenderingMode=FillStroke, LineWidth=.5pt, ✓ 32 + 25 both SO Deform 53.24
TextRenderingMode=FillStroke, LineWidth=.5pt, ✗ 32 + 25 both SO Deform 52.26
TextRenderingMode=FillStroke, LineWidth=.5pt, ✓ 32 + 5 both SO Deform 47.55
TextRenderingMode=FillStroke, LineWidth=.5pt, ✓ 32 + 10 both SO Deform 51.29
TextRenderingMode=FillStroke, LineWidth=.5pt, ✓ 32 + 50 both SO Deform 52.04
TextRenderingMode=FillStroke, LineWidth=.5pt, ✓ 32 only global SO Deform 50.64
TextRenderingMode=FillStroke, LineWidth=.5pt, ✓ 25 only local SO Deform 45.72
TextRenderingMode=FillStroke, LineWidth=.5pt, ✓ 32 + 25 both - Deform 50.97
TextRenderingMode=FillStroke, LineWidth=.5pt, ✓ 32 + 25 both Cont. Deform 49.84
TextRenderingMode=FillStroke, LineWidth=.5pt, ✓ 32 + 25 both MCR Deform 51.06
TextRenderingMode=FillStroke, LineWidth=.5pt, ✓ 57 local - Rand 51.81
Table 5: Results on ablation experiments.

Subsequently, we explore the 3 types of sampling regularization terms. We find that the Soft Orthogonality (Eq 3

) regularization achieves the best performance. The MCR regularization shows a high variance in the norm of gradients, which causes an unstable training process. We conjecture that the high variance comes from the

operator. Also, the contrastive loss shows the lowest accuracy. This could be due to the small batch size caused by the 12-layer transformer and ResNet-50 taking video data as input.

Finally, we compare DSR to a random sampling strategy of visual features. For the random sampling strategy, we randomly sample 57 visual features from the dense feature map during the training phase. Then, we uniformly sample the visual features from the flattened dense feature map at the inference phase. As expected, DSR with the diversity regularization and the global-local fusion achieves higher accuracy by a large margin than the random sampling under the same number of visual tokens. This indicates that our two strategies, avoiding collapsed sampling and global-local information interaction, are essential in this sampling-based VideoQA task.

In addition, we 1) analyze the diversity and suitability of sampled tokens by visualizing the output of the deformable sampler, 2) visualize the effectiveness of the dependency attention module, and 3) validate that the performance gain of DSR comes from novel modules we proposed, but not from the increased parameter, in supplementary materials.

5 Conclusion

This paper presents the state-of-the-art compositional reasoning model for video question answering tasks, DSR, which utilizes deformable sampling module and dependency attention module for efficient video-text representation learning. Based on our finding that the dense model performs better than the sparse model on the compositional reasoning dataset, which is a different point of view from previous work, we conditionally sample question-related visual features from a dense feature map. This process remarkably reduces the number of visual tokens needed for cross-modal transformer while rather improving the efficiency; maximum allowed batch size and performance increase. The dependency-based attention module eases the model to conduct multi-step reasoning by guiding a particular attention head with structuralized dependency relations. Extensive experiments verify our model especially stands out against others on intricate benchmarks. Comprehensive ablation studies demonstrate each factor fairly contributes to our model.

Acknowledgements This work was supported by Korea research grant from IITP (2022-0-00264/50%, 2022-0-00951/20%, 2022-0-00612/20%, 2020-0-01373/10%).


Supplementary Material

This material complements our paper with additional experimental results and their analysis. First of all, we verify the solidity of our model, DSR, with the additional quantitative experimental results in Section A. Section B shows the superiority of DSR in terms of memory efficiency. This is followed by a visual analysis of two modules we proposed, the deformable sampling module and the dependency attention module, in Section C. Afterward, Section D provides some qualitative examples that our model predicts correct answers. Lastly, Section E describes the implementation details, such as the settings for training. The code will be made publicly available.

Appendix A Additional Quantitative Results

Table 6 shows the superior performance of our model, DSR, according to the compositional reasoning step of given questions. The compositional reasoning step of a question refers to how many steps the question must be inferred in order to find the correct answer. In other words, the more reasoning step the question has, the more difficult it is. As shown in Table 6, our model consistently performs well compared to the strong baseline, ClipBERT, regardless of the compositional steps. Although our model is designed to target complex questions, it also works effectively for questions with low compositional steps. In particular, there is a large difference in performance for questions with five compositional steps, which require a lot of spatio-temporal reasoning.

In addition, we validate that the performance gain of DSR comes from novel modules we proposed, but not from the increased parameters, by ablation for each module considering fair parameter sizes. Based on the ClipBERT architecture that records the best score in Table 4 in the main paper (a), we add 2 transformer layers in place of the dependency attention, for the text embedding (b). In addition to (b), we add 4 transformer layers instead of the conditional sampler, for the visual embedding (c). Remarkably, (b; 52.62) and (c; 52.56) even record lower scores than (a; 53.24). Simply adding extra parameters does not increase the QA accuracy. Also, we postulated that adding question embedding to learnable queries is crucial for deformable sampler. Without conditioning on questions, sampled visual tokens will always become identical no matter which questions are given, which would be suboptimal for complex QA tasks. We got an accuracy of 52.39 without the question conditioning, which is lower than our model. Namely, all the modules we proposed add value.

DSR(Ours) ClipBERT [lei2021less]
Step 1 Binary 75.24 74.08
Open 9.23 8.46
All 74.98 73.82
Step 2 Binary 75.99 75.50
Open 46.92 46.48
All 55.87 55.42
Step 3 Binary 79.64 79.61
Open 71.24 70.55
All 74.70 73.87
Step 4 Binary 82.8 83.82
Open 49.84 48.86
All 54.01 53.29
Step 5 Binary 58.34 48.23
Open 57.78 33.41
All 50.26 38.30
Table 6: Quantitative comparison with ClipBERT on the reasoning step based subset of AGQA dataset.
# of frames Max Frames
2 4 8 16 32 64 128 256
Baseline 7.41 7.61 8.30 10.52 16.08 OOM OOM OOM 60
ClipBERT [lei2021less] 7.39 7.91 8.86 10.14 12.71 17.84 28.15 OOM 162
DSR(Ours) 6.69 6.81 7.21 8.31 9.68 13.64 23.65 32.46 269
Table 7: Comparison of memory consumption in GB. Max Frame in the last column means the maximum number of frames right before the OOM error.

Appendix B Memory Efficiency of DSR

For the spatio-temporally complex QA task, it is important for a model to cover as many frames as possible efficiently. Here, we compare how many frames can be addressed by each method until the OOM error is raised under the one NVIDIA V100 GPU environment that has 32GB GPU memory. For this experiment, we set the batch size as 1 using the same visual backbone and cross-modal transformer architecture for all methods. For the ClipBERT, each clip consists of 2 consecutive frames, i.e., 64 clips are needed to address 128 frame length, which is the default and the best configuration in their paper [lei2021less].

From Table 7, we observe that our DSR shows the best memory efficiency among all comparatives. For the Baseline model, all visual features are fed to the cross-modal transformer without any pooling or sparsification. As a result, the memory requirement of the model increases quadratically according to the length of the full visual feature sequence: where is the length of question words. In contrast, ClipBERT and DSR can address long sequences more efficiently than the baseline model. In ClipBERT, the feature map is pooled temporally so that only spatial sequence length () is considered in the cross-modal transformer. However, ClipBERT is less efficient than our DSR. Since the ClipBERT passes multiple short clips independently, the memory requirement of the cross-modal transformer becomes where denotes the number of clips. In DSR, the memory consumption is only proportional to : where and indicate the number of learnable queries and the length of global context features, respectively.

Appendix C Visual Analysis on Usefulness of Each Module

As introduced in Section 3 of the main paper, we proposed two novel modules for spatio-temporal reasoning. To provide the qualitative verification on the effectiveness of each module, this section consists of two parts, 1) qualitative examples of sampled visual features in line with the given question, and 2) visualization of dependency attention weights.

c.1 Justification of conditionally sampled visual features

Figure 6: Different sampling points according to different questions in the same video.
Figure 7: Video frames corresponding to the deformable sampling points.

This section justifies the validity of our deformable sampling module, which samples essential visual features conditioned to given questions. Instead of densely sampling redundant visual features, the module samples a few diverse samples, which are especially helpful for answering the given questions. For brevity, Figure 6 and 7 only represents the sampling points on the temporal axis. In Figure 6, we observe that our DSR samples different sets of frames based on each question, which indicates that DSR can sample frames in a question conditional way. Specifically, the model focuses on most of the temporal steps to answer a question in the bottom example of Figure 6, while the top example shows that the model sparsely attends to the specific temporal blocks.

Figure 7 depicts the corresponding video frames along with the sample question. To answer the given question, a model should understand the following actions in chronological order; 1) reading a book, 2) taking a blanket, and 3) snuggling under the blanket. Notably, sampled frames contain all related actions while excluding most of the unnecessary actions.

c.2 Efficacy of dependency attention head

Figure 8: Visualization of self-attention of question tokens that have passed through the dependency attention module.

Figure 8 visualizes the first head of the last layer of the dependency encoder we proposed, on the test dataset. The outputs of the dependency encoder turn into the text inputs of the cross-modal transformer for our model, while the baseline models only utilizes pre-trained Bert embeddings [devlin-etal-2019-bert] as the input. Compared to the Bert embeddings that contain general relationships among text tokens, dependency encoder benefits from simplifying long questions by further embedding hierarchical information.

Figure 9: Examples of intricate VideoQA problems of AGQA that our model predicts correct answer.

Appendix D Qualitative Examples of Successful Prediction

In Figure 9, we illustrate the examples of AGQA dataset that our model successfully predicts the answer. As shown in the examples, AGQA dataset consists of problems of solving complex questions for long video sequences containing various actions. For example, in the case of the third row in Figure 9, the problem can be solved only by recognizing the action of wiping glass and walking while understanding the order of the two actions. Therefore, our dense but effective model is needed to understand the comprehensive semantic structures of the video.

Appendix E Implementation Details

This section provides the detailed architecture of our method, including the overall framework and two main modules we introduced. Afterward, we provide the training details, such as hyper-parameters for each objective function, over the dataset we utilized.

e.1 Model architecture

Transformer based VideoQA model

We use the same architecture with ClipBERT [lei2021less] for the transformer-based VideoQA model. The number of layers and attention heads of each layer is set to 12. The hidden and intermediate dimensions are 768 and 3072, respectively. Also, we use GELU [hendrycks2016gaussian] action function for the transformer layers. For a classification head, we use 2-fully-connected layers.

Conditional Deformable Attention module

For the Conditional Deformable Attention (CDA) module, we use a 4-layer transformer based on the deformable decoder [zhu2020deformable]

. In each layer, we sample 8 offset points with 4 different attention heads. The hidden dimension of each transformer layer is 768 and we use ReLU 


as an activation function.

Dependency attention module

The 2-layer transformer with 12-multi-head is used as a backbone of the dependency attention module. The dependency-constrained attention module is implemented on the first head of the first layer. For the dependency head, attention probabilities are calculated based on dependency parsing relations and learnable value embeddings are multiplied by corresponding attention probabilities.

e.2 Hyperparameters and training details

Default Hyperparameters and Optimization

For all experiments by default, we set the learning rate and weight decay for all modules except for the CDA to 5e-5 and 1e-3, respectively. Also, we use AdamW [loshchilov2017decoupled] optimizer and use learning rate warmup over the first 10% training steps followed by linear decay to 0. Following the optimization strategy for [zhu2020deformable]

, the base learning rate of CDA module is 0.0001 and the learning rate is decayed at half of the training steps by a factor of 0.1. Also, learning rates of the linear projections, used for predicting object query reference points and sampling offsets, are multiplied by a factor of 0.1. For the video inputs, we resize frames to 448 pixels for the long spatial side and add zero padding to the remaining regions for the short side. Also, we set the maximum question length to 100 for all experiments except for TGIF-QA. For the TGIF-QA the maximum question length is 25.

Training details

For our main AGQA experimental results, we train our DSR for 5 epochs with a learning rate of 2e-4. We use a total of 32 NVIDIA V100 GPUs with a batch size of 8 per GPU. For ablations, we use 4 GPUs with a learning rate of 5e-5. For MSRVTT-QA and TVQA, we train our model for ten epochs and five epochs, respectively, and the remaining training details are the same as the main AGQA experiment. For TGIF-QA, we train ClipBERT and DSR for 60 epochs with a dropout probability of 0.4 for the final classification head. Other hyperparameters such as learning rate and weight decay are the same as the default setting described above.