1 Introduction†† *Xiaojie Wang is the corresponding author.
Visual Question Answering [Agrawal2015VQAVQ] is one of the most challenging multi-modal tasks. Given an image and a natural language question, the task is to correctly answer the question by making use of both visual and textual information. During the last few years, Visual Question Answering (VQA) has attracted rapidly growing attention. Since most questions not only focus on the objects in the image but also need to be answered by combining with the relationship between them. Visual relationship, therefore, plays a crucial role in VQA. Considering this, [Cadne2019MURELMR, Li2019RelationAwareGA, Hu2019LanguageConditionedGN]
adopt graph neural networks to comprehensively capture inter-object relations contained in images.
Although the above models have achieved better performances by exploring various relation features, a large amount of irrelevant information is also introduced, which affects the final performances of the above models. As shown in Figure 1, the first question for the image can be answered just by considering spatial relations while the other one only concentrates on semantic relations. Moreover, it is obvious that both the two questions only connect with a few objects. To coordinate different relationships properly and reduce the negative impact of the noises caused by redundant objects, we propose a novel model for VQA, named Question-Driven Graph Fusion Network(QD-GFN).
In general, in the first stage, to fully cover the rich relationship information in the image, inspired by [Li2019RelationAwareGA], we use multiple different graph attention networks to capture different types of visual relationship. Further, to accomplish question-guided graph fusion, a novel method is designed based on a cross-attention mechanism to measure correlation degree between question and each type of relationship, then these relationships between objects are updated according to the above correlation degree. While for object filtering, we propose an object priority coefficient, and use such coefficient to remove objects with low importance.
In principle, our work provides a new perspective for exploring VQA tasks. The contributions of our work can be concluded as follows:
We propose a novel question-guided graph fusion module to better coordinate different type of relationships.
We propose an object filtering mechanism to reduce the interference caused by irrelevant objects.
Our method achieves superior performance on the VQA 2.0 dataset [Goyal2017MakingTV] and the more challenging VQA-CP v2 dataset [Agrawal2018DontJA].
2 Related Work
2.1 Visual Question Answering
Conventional framework for VQA systems [Agrawal2015VQAVQ] consists of an image encoder, a question encoder, multimodal fusion, and an answer predictor. In the past few years, [Yu2017MultimodalFB, Benyounes2017MUTANMT] have explored new multimodal feature fusion strategies to better combine visual and textual information in high-dimensional space. Meanwhile, to better understand the visual contents of images and the semantics of questions, [Lu2016HierarchicalQC, Yu2017MultimodalFB] adopt a coarse co-attention. However, above coarse co-attention neglects the relationship between each image region and each question word. To fully excavate the above information, BAN [Kim2018BilinearAN] establish dense interactions between each image region and each question word. Recently, with the proposal of Transformer [Vaswani2017AttentionIA], [Gao2019DynamicFW, Gao2019MultiModalityLI] utilize multi-head attention to excavate the fine-grained implicit relationship in both inter- and intra-modality. To improve the interpretability of model, [Li2019RelationAwareGA, Cadne2019MURELMR] directly encode explicit relation, such as semantic relation and spatial relation between objects, into image representation.
Our work is complementary to above studies. Based on [Li2019RelationAwareGA], we further propose a question-guided graph fusion module to better aggregate the information contained in different types of graphs and introduce an object filtering mechanism to reduce interference caused by irrelevant information.
2.2 Visual Relationship
Early work treat Visual Relationship Detection(VRD) [Divvala2009AnES, Galleguillos2008ObjectCU]
as a post-processing step for object detection, the detected objects are re-scored by considering object relations contained in the image. Since utilizing visual relationships between objects is of great benefit to many computer vision and multimodal tasks,[Lu2016VisualRD] attempts to capture a variety of relationships between objects more sufficiently. Recently, with the propose of Scene Graph Generation (SGG) task, [Zellers2018NeuralMS, Tang2020UnbiasedSG] reduce the interference caused by bias and further improves the quality of relationship extraction.
Given a question and its corresponding image , the goal of VQA is to predict an answer that best matches . This task is defined as a classification problem [Agrawal2015VQAVQ]:
where is the parameters of the model. The detailed architecture of our QD-GFN is displayed in Figure 2 and Figure 3. Relation-aware visual features and textual features are obtained from Image Relation Encoder and Question Encoder respectively. Then, Question-guided Graph Fusion Module adaptively aggregates the rich information contained in the image according to the question and Object Filtering Module further eliminates the interference caused by irrelevant objects.
3.1 Question Encoder and Image Relation Encoder
Question Encoder is mainly implemented on a Transformer model [Vaswani2017AttentionIA]
, question will be padded to a maximum length of 20 and be encoded by Transformer with random initialization, denoted as, where . For the Image Relation Encoder, Faster R-CNN [Ren2015FasterRT] is first employed to recognise objects in , where each object is composed of a visual feature and a bounding-box feature . Inspired by [Li2019RelationAwareGA], we adopt three question-adaptive graph attention networks [Velickovic2018GraphAN] to encode semantic relations, spatial relations and implicit relations into different image representations. For each in the graph, we introduce multi-head attention mechanism to calculate its correlation coefficient with other objects as , and select objects with top K coefficients as its adjacency point sets , therefore, for each head, can be denoted as, where :
Implicit Graph: Due to there is no prior information introduced into the implicit graph , edges in implicit graph can be directly illustrated as:
Explicit Graph: Since edges in the semantic graph and the spatial graph contain label information, we modified multi-head attention mechanism to be sensitive to explicit relation labels:
where represents the explicit relation labels. The update process for the relational graphs can be illustrated as following, where :
3.2 Question-guided Graph Fusion Module
Question-guided Graph Fusion Module(GFM) aims to adaptively aggregate different types of relations and visual features according to their relevance to question. Specially, after obtaining the relation-aware visual features, we adopt visual-guided question attention to inject information from the question into visual features:
where L denote the length of questions, . As we adopt multi-head attention, the update of visual features can be depicted as:
where . To measure the correlation between question and each relation graph, mean pooling is preformed on question feature and each relation-aware graph feature to attain global representation and , where represent the k-th graph, then we calculate the cos similarity between and , denoted as :
where represents the k-th relation graph in graph sets. Finally, are utilized to guide the fusion process of above graphs:
and we denote the graph obtained after aggregation as .
3.3 Objects Filtering Module
To further reduce the interference caused by irrelevant information, Objects Filter Module (OF) is deployed to remove objects that are less important. Therefore, we introduce the object priority coefficient which is based on attention to measure the importance of each object in .
Object Priority Coefficient() is proposed to precisely measure the importance of each object. is regarded as a fully-connected graph, we calculate the correlation coefficient between two objects based on attention:
where . For the i-th object,its priority coefficient can be calculated as:
Then we select the top P objects with the largest as the effective objects set, which denoted as and update the relation between objects in :
Based on the updated relation , we again calculate the for the objects remained in , and perform the final aggregation as follows:
3.4 Answer Prediction
We conduct experiments to evaluate the performance of our QD-GFN on the VQA 2.0[Goyal2017MakingTV] and VQA-CP v2[Agrawal2018DontJA] datasets. Meanwhile, we perform extensive ablation studies to explore the potential factors which may affect the final performance of the model.
|BAN + Counter [Kim2018BilinearAN]||66.04||85.42||54.04||60.52||70.04||70.35|
4.1 Implementation Details
Each question is tokenized and padded with 0 to a maximum length of 20. We choose a 3-layer transformer encoder with 768 hidden size and 12 attention heads as the question encoder. For image relation encoder, we first extract pre-trained object features with bounding boxes from Faster R-CNN and then map them to the same dimension as the textual feature. We employ multi-head attention with 12 heads for all graph attention networks and set the number of adjacent points for each node to 15. The dimension of hidden layer in our model is set to 768. Our model is implemented based on PyTorch[Paszke2017AutomaticDI]. In experiments, we use Adamax optimizer for training, with the mini-batch size as 192. For choice of learning rate, we employ the warm-up strategy [Goyal2017AccurateLM]
. Specifically, we begin with a learning rate of 5e-4, linearly increasing it at each epoch till it reaches 2e-3. After 11 epochs, the learning rate is decreased by 0.2 for every 2 epochs up to 16 epochs. For transformer encoder, we fix the learning rate as 1e-4. Every linear mapping is regularized by weight normalization and dropout (p = 0.2 except for the classifier with 0.5).
4.2 Experimental Results
As shown in Table 1, our model achieves a superior performance over baseline methods [Anderson2018BottomUpAT, Kim2018BilinearAN, Cadne2019MURELMR, Li2019RelationAwareGA, Gao2019DynamicFW, Gao2019MultiModalityLI], on VQA 2.0 validation, test-dev and test-std splits. In detail, our method obtains an overall accuracy of 70.51 on test-dev split and 70.71 on test-std split, which surpasses DFAF [Gao2019DynamicFW] and MLIN [Gao2019MultiModalityLI]. Since DFAF and MLIN model dense intra-modal and inter-modal interactions simultaneously, introducing some irrelevant information, such results prove that eliminating irrelevant information can effectively raise the performance. Furthermore, compared with ReGAT[Li2019RelationAwareGA], our QD-GFN gains improvement of 0.43, 0.24 and 0.13 on validation, test-dev and test-std respectively. Since our model adopts a similar relation encoder as them, above experimental results demonstrate the effectiveness of our proposed GFM and OF.
Then, to test the generalization ability of our model, we conduct experiments on VQA-CP v2 dataset, where the distributions of the train and test splits are very different from each other. The results are illustrated in Table 2, compared with the previous methods that we have observed on VQA 2.0, QD-GFN surpasses the baseline by a larger margin.
In Table 3, we conducted several ablation experiments on VQA 2.0 validation split to verify the effectiveness of GFM and OF respectively. Specially, as we can observe from Table 3, without GFM and OF, the performance decreases significantly to 66.81. Comparison between line 1 and line 2 illustrates a gain of 0.26 for Question-guided Graph Fusion Module, and Object Filtering Module brings an improvement of 0.48 for our model. According to line 4, when GFM and OF are both utilized to give the complete model, it reaches the best performance of 67.61. This result shows that when the OF and GFM modules are combined, they can promote each other and bring greater performance gain.
To further explore the potential factors affecting the performance of the model, fig 4(a) illustrates the impact of the number of adjacent objects in image relation encoder on the performance of the model, it is obvious to find that setting the number of adjacent objects to 15 is the best choice. Besides, we test the effect of filtering different numbers of objects on the experimental results. As shown in fig 4(b), when 20 nodes are reserved after filtering, the model achieves the best performance. From the above two cases, we can find that both lacks of information and information redundancy hurt the model. Existing work pays more attention to the introduction of rich information, but ignores the processing of redundant information, which further prominent the value of our work. To more intuitively display the effects of GFM and OF, we visualize the regions of interest in the image. As shown in fig 5, to comprehensively verify the effectiveness of our proposed QD-GFN, we choose questions containing multiple relationship types as cases. From column 2, when GFM and OF are both eliminated, the model is more easily disturbed and thus attain a wrong answer. By comparing column 2 with column 3, we observe that GFM effectively reduces the wrong attention of the model, and the difference between column 3 and column 4 indicates that object filtering module further eliminates irrelevant objects with strong interference. Column 4 shows that by combining above two modules together, the model completes the filtering of irrelevant information with different granularity, and correctly focuses on question-relevant objects.
We propose Question-Driven Graph Fusion Network(QD-GFN), a novel framework based on image relation encoder for visual question answering, which consists of Question Encoder, Image Relation Encoder, Question-guided Graph Fusion Module(GFM), and Object Filtering Module(OF). Through the GFM and OF modules, our model effectively utilizes information contained in the image and reduces the interference caused by irrelevant information, to achieve competitive results on both VQA 2.0 and VQA-CP v2 datasets. However, the current design of our GFM is still coarse. For future work, we plan to excavate the associated information between text and image more precisely, to better align the information between different modalities.
We first thank anonymous reviewers for their suggestions and comments, and then thank our colleagues for their contributions in providing suggestions for this work. At last, this work was partially supported by the National Natural Science Foundation of China (NSFC62076032) and the Cooperation Project with Beijing SanKuai Technology Co., Ltd.