Mucko: Multi-Layer Cross-Modal Knowledge Reasoning for Fact-based Visual Question Answering

06/16/2020 ∙ by Zihao Zhu, et al. ∙ The University of Adelaide Microsoft 0

Fact-based Visual Question Answering (FVQA) requires external knowledge beyond visible content to answer questions about an image, which is challenging but indispensable to achieve general VQA. One limitation of existing FVQA solutions is that they jointly embed all kinds of information without fine-grained selection, which introduces unexpected noises for reasoning the final answer. How to capture the question-oriented and information-complementary evidence remains a key challenge to solve the problem. In this paper, we depict an image by a multi-modal heterogeneous graph, which contains multiple layers of information corresponding to the visual, semantic and factual features. On top of the multi-layer graph representations, we propose a modality-aware heterogeneous graph convolutional network to capture evidence from different layers that is most relevant to the given question. Specifically, the intra-modal graph convolution selects evidence from each modality and cross-modal graph convolution aggregates relevant information across different modalities. By stacking this process multiple times, our model performs iterative reasoning and predicts the optimal answer by analyzing all question-oriented evidence. We achieve a new state-of-the-art performance on the FVQA task and demonstrate the effectiveness and interpretability of our model with extensive experiments. The code is available at



There are no comments yet.


page 1

page 3

page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Visual question answering (VQA) [3] is an attractive research direction aiming to jointly analyze multimodal content from images and natural language. Equipped with the capacities of grounding, reasoning and translating, a VQA agent is expected to answer a question in natural language based on an image. Recent works [7, 17, 6] have achieved great success in the VQA problems that are answerable by solely referring to the visible content of the image. However, such kinds of models are incapable of answering questions which require external knowledge beyond what is in the image. Considering the question in Figure 1, the agent not only needs to visually localize ‘the red cylinder’, but also to semantically recognize it as ‘fire hydrant’ and connects the knowledge that ‘fire hydrant is used for firefighting’. Therefore, how to collect the question-oriented and information-complementary evidence from visual, semantic and knowledge perspectives is essential to achieve general VQA.

Figure 1: An illustration of our motivation. We represent an image by multi-layer graphs and cross-modal knowledge reasoning is conducted on the graphs to infer the optimal answer.

To advocate research in this direction,  [29] introduces the ‘Fact-based’ VQA (FVQA) task for answering questions by joint analysis of the image and the knowledge base of facts. The typical solutions for FVQA build a fact graph with fact triplets filtered by the visual concepts in the image and select one entity in the graph as the answer. Existing works [28, 29] parse the question as keywords and retrieve the supporting-entity only by keyword matching. This kind of approaches is vulnerable when the question does not exactly mention the visual concepts (e.g. synonyms and homographs) or the mentioned information is not captured in the fact graph (e.g. the visual attribute ‘red’ in Figure 1 may be falsely omitted). To resolve these problems, [22] introduces visual information into the fact graph and infers the answer by implicit graph reasoning under the guidance of the question. However, they provide the whole visual information equally to each graph node by concatenation of the image, question and entity embeddings. Actually, only part of the visual content are relevant to the question and a certain entity. Moreover, the fact graph here is still homogeneous since each node is represented by a fixed form of image-question-entity embedding, which limits the model’s flexibility of adaptively capturing evidence from different modalities.

In this work, we depict an image as a multi-modal heterogeneous graph, which contains multiple layers of information corresponding to different modalities. The proposed model is focused on Multi-Layer Cross-Modal Knowledge Reasoning and we name it as Mucko for short. Specifically, we encode an image by three layers of graphs, where the object appearance and their relationships are kept in the visual layer, the high-level abstraction for bridging the gaps between visual and factual information is provided in the semantic layer, and the corresponding knowledge of facts are supported in the fact layer. We propose a modality-aware heterogeneous graph convolutional network to adaptively collect complementary evidence in the multi-layer graphs. It can be performed by two procedures. First, the Intra-Modal Knowledge Selection procedure collects question-oriented information from each graph layer under the guidance of question; Then, the Cross-Modal Knowledge Reasoning procedure captures complementary evidence across different layers.

The main contributions of this paper are summarized as follows: (1) We comprehensively depict an image by a heterogeneous graph containing multiple layers of information based on visual, semantic and knowledge modalities. We consider these three modalities jointly and achieve significant improvement over state-of-the-art solutions. (2) We propose a modality-aware heterogeneous graph convolutional network to capture question-oriented evidence from different modalities. Especially, we leverage an attention operation in each convolution layer to select the most relevant evidence for the given question, and the convolution operation is responsible for adaptive feature aggregation. (3) We demonstrate good interpretability of our approach and provide case study in deep insights. Our model automatically tells which modality (visual, semantic or factual) and entity have more contributions to answer the question through visualization of attention weights and gate values.

2 Related Work

Visual Question Answering.

The typical solutions for VQA are based on the CNN-RNN architecture [20] and leverage global visual features to represent image, which may introduce noisy information. Various attention mechanisms [32, 19, 2] have been exploited to highlight visual objects that are relevant to the question. However, they treat objects independently and ignore their informative relationships. [4] demonstrates that human’s ability of combinatorial generalization highly depends on the mechanisms for reasoning over relationships. Consistent with such proposal, there is an emerging trend to represent the image by graph structure to depict objects and relationships in VQA and other vision-language tasks [10, 27, 17]. As an extension, [11] exploits natural language to enrich the graph-based visual representations. However, it solely captures the semantics in natural language by LSTM, which lacking of fine-grained correlations with the visual information. To go one step further, we depict an image by multiple layers of graphs from visual, semantic and factual perspectives to collect fine-grained evidence from different modalities.

Fact-based Visual Question Answering.

Human can easily combine visual observation with external knowledge for answering questions, which remains challenging for algorithms. [29] introduces a fact-based VQA task, which provides a knowledge base of facts and associates each question with a supporting-fact. Recent works based on FVQA generally select one entity from fact graph as the answer and falls into two categories: query-mapping based methods and learning based methods. [28] reduces the question to one of the available query templates and this limits the types of questions that can be asked. [29]

automatically classifies and maps the question to a query which does not suffer the above constraint. Among both methods, however, visual information are used to extract facts but not introduced during the reasoning process.  

[22] applies GCN on the fact graph where each node is represented by the fixed form of image-question-entity embedding. However, the visual information is wholly provided which may introduce redundant information for prediction. In this paper, we decipt an image by multi-layer graphs and perform cross-modal heterogeneous graph reasoning on them to capture complementary evidence from different layers that most relevant to the question.

Heterogeneous Graph Neural Networks.

Graph neural networks are gaining fast momentum in the last few years 

[31]. Compared with homogeneous graphs, heterogeneous graphs are more common in the real world. [26] generalizes graph convolutional network (GCN) to handle different relationships between entities in a knowledge base, where each edge with distinct relationships is encoded independently. [30, 9] propose heterogeneous graph attention networks with dual-level attention mechanism. All of these methods model different types of nodes and edges on a unified graph. In contrast, the heterogeneous graph in this work contains multiple layers of subgraphs and each layer consists of nodes and edges coming from different modalities. For this specific constrain, we propose the intra-modal and cross-modal graph convolutions for reasoning over such multi-modal heterogeneous graphs.

3 Methodology

Figure 2: An overview of our model. The model contains two modules: Multi-modal Heterogeneous Graph Construction aims to depict an image by multiple layers of graphs and Cross-modal Hetegeneous Graph Reasoning supports intra-modal and cross-modal evidence selection.

Given an image and a question , the task aims to predict an answer while leveraging external knowledge base, which consists of facts in the form of triplet, i.e. , where is a visual concept in the image, is an attribute or phrase and represents the relationship between and . The key is to choose a correct entity, i.e. either or , from the supporting fact as the predicted answer. We first introduce a novel scheme of depicting an image by three layers of graphs, including the visual graph, semantic graph and fact graph respectively, imitating the understanding of various properties of an object and the relationships. Then we perform cross-modal heterogeneous graph reasoning that consists of two parts: Intra-Modal Knowledge Selection aims to choose question-oriented knowledge from each layer of graphs by intra-modal graph convolutions, and Cross-Modal Knowledge Reasoning adaptively selects complementary evidence across three layers of graphs by cross-modal graph convolutions. By stacking the above two processes multiple times, our model performs iterative reasoning across all the modalities and results in the optimal answer by jointly analyzing all the entities. Figure 2 gives detailed illustration of our model.

3.1 Multi-Modal Graph Construction

Visual Graph Construction.

Since most of the questions in FVQA grounded in the visual objects and their relationships, we construct a fully-connected visual graph to represent such evidence at appearance level. Given an image , we use Faster-RCNN [25] to identify a set of objects ( = 36), where each object

is associated with a visual feature vector

( = 2048), a spatial feature vector ( = 4) and a corresponding label. Specifically, , where , and respectively denote the coordinate of the top-left corner, the height and width of the bounding box. We construct a visual graph over , where is the node set and each node corresponds to a detected object . The feature of node is represented by . Each edge denotes the relative spatial relationships between two objects. We encode the edge feature by a 5-dimensional vector, i.e. .

Semantic Graph Construction.

In addition to visual information, high-level abstraction of the objects and relationships by natural language provides essential semantic information. Such abstraction is indispensable to associate the visual objects in the image with the concepts mentioned in both questions and facts. In our work, we leverage dense captions [12] to extract a set of local-level semantics in an image, ranging from the properties of a single object (color, shape, emotion, ) to the relationships between objects (action, spatial positions, comparison, ). We decipt an image by dense captions, denoted as , where is a natural language description about a local region in the image. Instead of using monolithic embeddings to represent the captions, we exploit to model them by a graph-based semantic representation, denoted as , which is constructed by a semantic graph parsing model [1]. The node represents the name or attribute of an object extracted from the captions while the edge represents the relationship between and . We use the averaged GloVe embeddings [24] to represent and , denoted as and , respectively. The graph representation retains the relational information among concepts and unifies the representations in graph domain, which is better for explicit reasoning across modalities.

Fact Graph Construction.

To find the optimal supporting-fact, we first retrieve relevant candidate facts from knowledge base of facts following a scored based approach proposed in [22]

. We compute the cosine similarity of the embeddings of every word in the fact with the words in the question and the words of visual concepts detected in the image. Then we average these values to assign a similarity score to the fact. The facts are sorted based on the similarity and the 100 highest scoring facts are retained, denoted as

. A relation type classifier is trained additionally to further filter the retrieved facts. Specifically, we feed the last hidden state of LSTM to an MLP layer to predict the relation type of a question. We retain the facts among only if their relationships agree with , i.e. ( contains top-3 predicted relationships in experiments). Then a fact graph is built upon as the candidate facts can be naturally organized as graphical structure. Each node denotes an entity in and is represented by GloVe embedding of the entity, denoted as . Each edge denotes the relationship between and and is represented by GloVe embedding . The topological structure among facts can be effectively exploited by jointly considering all the entities in the fact graph.

3.2 Intra-Modal Knowledge Selection

Since each layer of graphs contains modality-specific knowledge relevant to the question, we first select valuable evidence independently from the visual graph, semantic graph and fact graph by Visual-to-Visual Convolution, Semantic-to-Semantic Convolution and Fact-to-Fact Convolution respectively. These three convolutions share the common operations but differ in their node and edge representations corresponding to the graph layers. Thus we omit the superscript of node representation and edge representation in the rest of this section. We first perform attention operations to highlight the nodes and edges that are most relevant to the question and consequently update node representations via intra-modal graph convolution. This process mainly consists of the following three steps:

Question-guided Node Attention.

We first evaluate the relevance of each node corresponding to the question by attention mechanism. The attention weight for is computed as:


where , and (as well as ,…, , , mentioned below) are learned parameters. is question embedding encoded by LSTM.

Question-guided Edge Attention.

Under the guidance of question, we then evaluate the importance of edge constrained by the neighbor node regarding to as:


where , and denotes concatenation operation.

Intra-Modal Graph Convolution.

Given the node and edge attention weights learned in Eq. 1 and Eq. 2, the node representations of each layer of graphs are updated following the message-passing framework [8]. We gather the neighborhood information and update the representation of as:


where is the neighborhood set of node .

We conduct the above intra-modal knowledge selection on , and independently and obtain the updated node representations, denoted as , and accordingly.

3.3 Cross-Modal Knowledge Reasoning

To answer the question correctly, we fully consider the complementary evidence from visual, semantic and factual information. Since the answer comes from one entity in the fact graph, we gather complementary information from visual graph and semantic graph to fact graph by cross-modal convolutions, including visual-to-fact convolution and semantic-to-fact convolution. Finally, a fact-to-fact aggregation is performed on the fact graph to reason over all the entities and form a global decision.

Visual-to-Fact Convolution.

For the entity in fact graph, the attention value of each node in the visual graph w.r.t. is calculated under the guidance of question:


The complementary information from visual graph for is computed as:


Semantic-to-Fact Convolution.

The complementary information from the semantic graph is computed in the same way as in Eq. 5 and Eq. 6.

Then we fuse the complementary knowledge for from three layers of graphs via a gate operation:



is sigmoid function and “

” denotes element-wise product.

Fact-to-Fact Aggregation.

Given a set of candidate entities in the fact graph , we aim to globally compare all the entities and select an optimal one as the answer. Now the representation of each entity in the fact graph gathers question-oriented information from three modalities. To jointly evaluate the possibility of each entity, we perform the attention-based graph convolutional network similar to Fact-to-Fact Convolution introduced in Section 3.2 to aggregate information in the fact graph and obtain the transformed entity representations.

We iteratively perform intra-modal knowledge selection and cross-modal knowledge reasoning in multiple steps to obtain the final entity representations. After steps, each entity representation captures the structural information within -hop neighborhood across three layers.

3.4 Learning

The concatenation of entity representation and question embedding

is passed to a binary classifier to predict its probability as the answer ,

i.e. . We apply the binary cross-entropy loss in the training process:


where is the ground truth label for and

represent loss function weights for positive and negative samples respectively. The entity with the largest probability is selected as the final answer.

4 Experiments

Method Overall Accuracy
top-1 top-3
FVQA (top-3-QQmaping)
FVQA (Ensemble) -
Straight to the Facts (STTF)
Reading Comprehension
Out of the Box (OB)
Human -
Table 1: State-of-the-art comparison on FVQA dataset.
Method Overall Accuracy
top-1 top-3
Mucko (full model)
1 w/o Semantic Graph
2 w/o Visual Graph
3 w/o Semantic Graph & Visual Graph
4 S-to-F Concat.
5 V-to-F Concat.
6 V-to-F Concat. & S-to-F Concat.
7 w/o relationships
Table 2: Ablation study of key components of Mucko.


We evaluate Mucko on the FVQA dataset [29]. It consists of 2,190 images, 5,286 questions and a knowledge base of 193,449 facts. Facts are constructed by extracting top visual concepts in the dataset and querying these concepts in WebChild, ConceptNet and DBPedia. 333We provide more experimental results on OK-VQA and Visual7W+KB in supplementary materials.

Evaluation Metrics.

We follow the metrics in [29] to evaluate the performance. The top-1 and top-3 accuracy is calculated for each method. The averaged accuracy of 5 test splits is reported as the overall accuracy.

Implementation Details.

We select the top-10 dense captions according to their confidence. The max sentence length of dense captions and the questions is set to 20. The hidden state size of all the LSTM blocks is set to 512. We set and

in the binary cross-entropy loss. Our model is trained by Adam optimizer with 20 epochs, where the mini-batch size is 64 and the dropout ratio is 0.5. Warm up strategy is applied for 2 epochs with initial learning rate

and warm-up factor . Then we use cosine annealing learning strategy with initial learning rate and termination learning rate for the rest epochs.

Figure 3: Visualization for Mucko. Visual graph highlights the most relevant subject (red box) according to attention weights of each object ( in Eq. 1) and the objects (orange boxes) with top-2 attended relationships ( in Eq. 2). Fact graph shows the predicted entity (center node) and its top-4 attended neighbors ( in Eq. 1). Semantic graph shows the most relevant concept (center node) and its up to top-4 attended neighbors ( in Eq. 1). Each edge is marked with attention value ( in Eq. 2). Dash lines represent visual-to-fact convolution (orange) and semantic-to-fact convolution weights (blue) of the predicted entity ( in Eq. 5). The thermogram on the top visualizes the gate values ( in Eq. 7) of visual embedding (left), entity embedding (middle) and semantic embedding (right).

4.1 Comparison with State-of-the-Art Methods

Table 1 shows the comparison of Mucko with state-of-the-art models, including CNN-RNN based approaches [29], i.e. LSTM-Question+Image+Pre-VQA and Hie-Question+Image+Pre-VQA, semantic parsing based approaches [29], i.e. FVQA (top-3-QQmaping) and FVQA (Ensemble), learning-based approaches, i.e. Straight to the Facts (STTF) [23] and Out of the Box (OB) [22], and Reading Comprehension based approach [16]. Our model consistently outperforms all the approaches on all the metrics and achieves 3.71% boost on top-1 accuracy and 5.69% boost on top-3 accuracy compared with the state-of-the-art model. The model OB is most relevant to Mucko in that it leverages graph convolutional networks to jointly assess all the entities in the fact graph. However, it introduces the global image features equally to all the entities without selection. By collecting question-oriented visual and semantic information via modality-aware heterogeneous graph convolutional networks, our model gains remarkable improvement.

4.2 Ablation Study

In Table 2, we shows ablation results to verify the contribution of each component in our model. (1) In models ‘1-3’, we evaluate the influence of each layer of graphs on the performance. We observe that the top-1 accuracy of ‘1’ and ‘2’ respectively decreases by 1.1% and 3.94% compared with the full model, which indicates that both semantic and visual graphs are beneficial to provide valuable evidence for answer inference. Thereinto, the visual information has greater impact than the semantic part. When removing both semantic and visual graphs, ‘3’ results in a significant decrease. (2) In models ‘4-6’, we assess the effectiveness of the proposed cross-modal graph convolutions. ‘4’, ‘5’ and ‘6’ respectively replace the ‘Semantic-to-Fact Conv.’ in ‘2’, ‘Visual-to-Fact Conv.’ in ‘1’ and both in full model by concatenation, i.e. concatenating the mean pooling of all the semantic/visual node features with each entity feature. The performance decreases when replacing the convolution from either S-to-F or V-to-F, or both simultaneously, which proves the benefits of cross-modal convolution in gathering complementary evidence from different modalities. (3) We evaluate the influence of the relationships in the heterogeneous graph. We omit the relational features in all the three layers in ‘7’ and the performance decreases by nearly 1% on top-1 accuracy. It proves the benefits of relational information, though it is less influential than the modality information.

4.3 Interpretability

Our model is interpretable by visualizing the attention weights and gate values in the reasoning process. From case study in Figure 3, we conclude with the following three insights: (1) Mucko is capable to reveal the knowledge selection mode. The first two examples indicate that Mucko captures the most relevant visual, semantic and factual evidence as well as complementary information across three modalities. In most cases, factual knowledge provides predominant clues compared with other modalities according to gate values because FVQA relies on external knowledge to a great extent. Furthermore, more evidence comes from the semantic modality when the question involves complex relationships. For instance, the second question involving the relationship between ‘hand’ and ‘while round thing’ needs more semantic clues. (2) Mucko has advantages over the state-of-the-art model.

The third example compares the predicted answer of OB with Mucko. Mucko collects relevant visual and semantic evidence to make each entity discriminative enough for predicting the correct answer while OB failing to distinguish representations of ‘laptop’ and ‘keyboard’ without feature selection.

(3) Mucko fails when multiple answers are reasonable for the same question. Since both ‘wedding’ and ‘party’ may have cakes, the predicted answer ‘party’ in the last example is reasonable from human judgement.

#Retrieved facts @50 @100 @150 @200
Rel@1 (top-1 accuracy) Rel@1 (top-3 accuracy) 55.56 70.62 65.94 59.77
64.09 81.95 73.41 66.32
Rel@3 (top-1 accuracy) Rel@3 (top-3 accuracy) 58.93 73.06 70.12 65.93
68.50 85.94 81.43 74.87
Table 3: Overall accuracy with different number of retrieved candidate facts and different number of relation types.
#Steps 1 2 3
top-1 accuracy 62.05 73.06 70.43
top-3 accuracy 71.87 85.94 81.32
Table 4: Overall accuracy with different number of reasoning steps.

4.4 Parameter Analysis

In Table 3, we vary the number of retrieved candidate facts and relation types for candidate filtering. We achieve the highest downstream accuracy with top-100 candidate facts and top-3 relation types. In Table 4, we evaluate the influence of different number of reasoning steps . We find that two reasoning steps achieve the best performance. We use the above settings in our full model.

5 Conclusion

In this paper, we propose Mucko for visual question answering requiring external knowledge, which focuses on multi-layer cross-modal knowledge reasoning. We novelly depict an image by a heterogeneous graph with multiple layers of information corresponding to visual, semantic and factual modalities. We propose a modality-aware heterogeneous graph convolutional network to select and gather intra-modal and cross-modal evidence iteratively. Our model outperforms the state-of-the-art approaches remarkably and obtains interpretable results on the benchmark dataset.


This work is supported by the National Key Research and Development Program (Grant No.2017YFB0803301).

6 Supplementary Materials

We also conduct extensive experiments on another two large-scale knowledge-based VQA datasets: OK-VQA [21] and Visual7W+KB [15] to evaluate performance of our proposed model. In this section, we first briefly review the dataset and then report the performance of our proposed method comparing with several baseline models.

6.1 Datasets


The Visual7W dataset [33] is built based on a subset of images from Visual Genome [14], which includes questions in terms of (what, where, when, who, why, which and how) along with the corresponding answers in a multi-choice format. However, most of questions of Visual7W solely base on the image content which don’t require external knowledge. Furthermore, [15] generated a collection of knowledge-based questions based on the test images in Visual7W by filling a set of question-answer templates that need to reason on both visual content and external knowledge. We denoted this dataset as Visual7W+KB in our paper. In general, Visual7W+KB consists of 16,850 open-domain question-answer pairs based on 8,425 images in Visual7W test split. Different from FVQA, Visual7W+KB uses ConceptNet to guide the question generation but doesn’t provide a task-specific knowledge base. In our work, we also leverage ConceptNet to retrieve the supporting knowledge and select one entity as the predicted answer.


[21] proposed the Outside Knowledge VQA (OK-VQA) dataset, which is the largest knowledge-based VQA dataset at present. Different from existing datasets, the questions in OK-VQA are manually generated by MTurk workers, which are not derived from specific knowledge bases. Therefore, it requires the model to retrieve supporting knowledge from open-domain resources, which is much closer to the general VQA but more challenging for existing models. OK-VQA contains 14,031 images which are randomly collected from MSCOCO dataset [18], using the original 80k-40k training and validation splits as train and test splits. OK-VQA contains 14,055 questions covering a variety of knowledge categories such as science & technology, history, and sports.

6.2 Experimental results on Visual7W+KB

The comparison of state-of-the-art models on Visual7W-KB dataset is shown in the Table 5. The compared baselines contains two sets, i.e. memory-based approaches and a graph-based approach. The memory-based approaches [15] include KDMN-NoKnowledge (w/o external knowledge), KDMN-NoMemory (attention-based knowledge incorporation), KDMN (dynamic memory network based knowledge incorporation) and KDMN-Ensemble (several KDMN models based ensemble model). We also test the performance of Out of the Box (OB) [22] on Visual7W-KB and report the results in Table 5.

As consistent with the results on FVQA, we achieve a significant improvement (7.98% on top-1 accuracy and 13.52% on top-3 accuracy ) over state-of-the-art models. Note that our proposed method is an single-model, which outperforms the existing ensembled model [15].

Method Overall Accuracy
top-1 top-3
KDMN-NoKnowledge [15] -
KDMN-NoMemory [15] -
KDMN [15] -
KDMN-Ensemble [15] -
Out of the Box (OB) [22]
Mucko (ours)
Table 5: State-of-the-art comparison on Visual7W+KB dataset.

6.3 Experimental results on OK-VQA

We also report the performance on the challenging OK-VQA dataset in Table 6. We compare our model with three kinds of existing models, including current state-of-the-art VQA models, knowledge-based VQA models and ensemble models. The VQA models contain Q-Only [21], MLP [21], BAN [13], MUTAN[13]. The knowledge-based VQA models [21] consist of ArticleNet (AN), BAN+AN and MUTAN+AN. The ensemble models [21], i.e. BAN/AN oracle and MUTAN/AN oracle, simply take the raw ArticleNet and VQA model predictions, taking the best answer (comparing to ground truth) from either.

Our model consistently outperforms all the compared models on the overall performance. Even the state-of-the-art models (BAN and MUTAN) specifically designed for VQA tasks, they get inferior results compared with ours. This indicates that general VQA task like OK-VQA cannot be simply solved by a well-designed model, but requires the ability to incorporate external knowledge in an effective way. Moreover, our model outperforms knowledge-based VQA models including both single models (BAN+AN and MUTAN+AN) and ensemble models (BAN/AN oracle and MUTAN/AN oracle), which further proves the advantages of our proposed multi-layer heterogeneous graph representation and cross-modal heterogeneous graph reasoning.

Method Overall Accuracy
top-1 top-3
Q-Only [21] 14.93 -
MLP [21] 20.67 -
BAN [13] 25.17 -
MUTAN [5] 26.41 -
ArticleNet (AN) [21] 5.28 -
BAN + AN [21] 25.61 -
MUTAN + AN [21] 27.84 -
BAN/AN oracle [21] 27.59 -
MUTAN/AN oracle [21] 28.47 -
Mucko (ours)
Table 6: State-of-the-art comparison on OK-VQA dataset.


  • [1] P. Anderson, B. Fernando, M. Johnson, and S. Gould (2016) SPICE: semantic propositional image caption evaluation. In ECCV, pp. 382–398. Cited by: §3.1.
  • [2] P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L. Zhang (2018)

    Bottom-up and top-down attention for image captioning and visual question answering

    In CVPR, pp. 6319–6328. Cited by: §2.
  • [3] S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. Lawrence Zitnick, and D. Parikh (2015) Vqa: visual question answering. In ICCV, pp. 2425–2433. Cited by: §1.
  • [4] P. W. Battaglia, J. B. Hamrick, V. Bapst, A. Sanchez-Gonzalez, V. Zambaldi, M. Malinowski, A. Tacchetti, D. Raposo, A. Santoro, R. Faulkner, et al. (2018)

    Relational inductive biases, deep learning, and graph networks

    arXiv preprint arXiv:1806.01261. Cited by: §2.
  • [5] H. Ben-Younes, R. Cadene, M. Cord, and N. Thome (2017) Mutan: multimodal tucker fusion for visual question answering. In ICCV, pp. 2612–2620. Cited by: Table 6.
  • [6] H. Ben-Younes, R. Cadene, N. Thome, and M. Cord (2019) Block: bilinear superdiagonal fusion for visual question answering and visual relationship detection. In AAAI, pp. 8102–8109. Cited by: §1.
  • [7] R. Cadene, H. Ben-Younes, M. Cord, and N. Thome (2019) Murel: multimodal relational reasoning for visual question answering. In CVPR, pp. 1989–1998. Cited by: §1.
  • [8] J. Gilmer, S. S. Schoenholz, P. F. Riley, O. Vinyals, and G. E. Dahl (2017) Neural message passing for quantum chemistry. In ICML, pp. 1263–1272. Cited by: §3.2.
  • [9] L. Hu, T. Yang, C. Shi, H. Ji, and X. Li (2019) Heterogeneous graph attention networks for semi-supervised short text classification. In EMNLP, pp. 4823–4832. Cited by: §2.
  • [10] R. Hu, A. Rohrbach, T. Darrell, and K. Saenko (2019) Language-conditioned graph networks for relational reasoning. In ICCV, pp. 10294–10303. Cited by: §2.
  • [11] X. Jiang, J. Yu, Z. Qin, Y. Zhuang, X. Zhang, Y. Hu, and Q. Wu (2020) DualVD: an adaptive dual encoding model for deep visual understanding in visual dialogue. In AAAI, Cited by: §2.
  • [12] J. Johnson, A. Karpathy, and L. Fei-Fei (2016)

    Densecap: fully convolutional localization networks for dense captioning

    In CVPR, pp. 4565–4574. Cited by: §3.1.
  • [13] J. Kim, J. Jun, and B. Zhang (2018) Bilinear attention networks. In NeurIPS, pp. 1564–1574. Cited by: §6.3, Table 6.
  • [14] R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. Kalantidis, L. Li, D. A. Shamma, et al. (2017) Visual genome: connecting language and vision using crowdsourced dense image annotations. IJCV, pp. 32–73. Cited by: §6.1.
  • [15] G. Li, H. Su, and W. Zhu (2017) Incorporating external knowledge to answer open-domain visual questions with dynamic memory networks. arXiv preprint arXiv:1712.00733. Cited by: §6.1, §6.2, §6.2, Table 5, §6.
  • [16] H. Li, P. Wang, C. Shen, and A. v. d. Hengel (2019) Visual question answering as reading comprehension. In CVPR, pp. 6319–6328. Cited by: §4.1.
  • [17] L. Li, Z. Gan, Y. Cheng, and J. Liu (2019) Relation-aware graph attention network for visual question answering. In ICCV, pp. 10313–10322. Cited by: §1, §2.
  • [18] T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick (2014) Microsoft coco: common objects in context. In ECCV, pp. 740–755. Cited by: §6.1.
  • [19] J. Lu, J. Yang, D. Batra, and D. Parikh (2016) Hierarchical question-image co-attention for visual question answering. In NeurIPS, pp. 289–297. Cited by: §2.
  • [20] M. Malinowski, M. Rohrbach, and M. Fritz (2015)

    Ask your neurons: a neural-based approach to answering questions about images

    In ICCV, pp. 1–9. Cited by: §2.
  • [21] K. Marino, M. Rastegari, A. Farhadi, and R. Mottaghi (2019) Ok-vqa: a visual question answering benchmark requiring external knowledge. In CVPR, pp. 3195–3204. Cited by: §6.1, §6.3, Table 6, §6.
  • [22] M. Narasimhan, S. Lazebnik, and A. Schwing (2018) Out of the box: reasoning with graph convolution nets for factual visual question answering. In NeurIPS, pp. 2654–2665. Cited by: §1, §2, §3.1, §4.1, §6.2, Table 5.
  • [23] M. Narasimhan and A. G. Schwing (2018) Straight to the facts: learning knowledge base retrieval for factual visual question answering. In ECCV, pp. 451–468. Cited by: §4.1.
  • [24] J. Pennington, R. Socher, and C. D. Manning (2014) Glove: global vectors for word representation. In EMNLP, pp. 1532–1543. Cited by: §3.1.
  • [25] S. Ren, K. He, R. Girshick, and J. Sun (2017) Faster r-cnn: towards real-time object detection with region proposal networks. 39 (6), pp. 1137–1149. Cited by: §3.1.
  • [26] M. Schlichtkrull, T. N. Kipf, P. Bloem, R. Van Den Berg, I. Titov, and M. Welling (2018) Modeling relational data with graph convolutional networks. In ESWC, pp. 593–607. Cited by: §2.
  • [27] P. Wang, Q. Wu, J. Cao, C. Shen, L. Gao, and A. v. d. Hengel (2019) Neighbourhood watch: referring expression comprehension via language-guided graph attention networks. In CVPR, pp. 1960–1968. Cited by: §2.
  • [28] P. Wang, Q. Wu, C. Shen, A. R. Dick, and A. van den Hengel (2017) Explicit knowledge-based reasoning for visual question answering. In IJCAI, pp. 1290–1296. Cited by: §1, §2.
  • [29] P. Wang, Q. Wu, C. Shen, A. Dick, and A. van den Hengel (2018) Fvqa: fact-based visual question answering. TPAMI 40 (10), pp. 2413–2427. Cited by: §1, §2, §4, §4, §4.1.
  • [30] X. Wang, H. Ji, C. Shi, B. Wang, Y. Ye, P. Cui, and P. S. Yu (2019) Heterogeneous graph attention network. In WWW, pp. 2022–2032. Cited by: §2.
  • [31] Z. Wu, S. Pan, F. Chen, G. Long, C. Zhang, and P. S. Yu (2019) A comprehensive survey on graph neural networks. arXiv preprint arXiv:1901.00596. Cited by: §2.
  • [32] Z. Yang, X. He, J. Gao, L. Deng, and A. Smola (2016) Stacked attention networks for image question answering. In CVPR, pp. 21–29. Cited by: §2.
  • [33] Y. Zhu, O. Groth, M. Bernstein, and L. Fei-Fei (2016) Visual7w: grounded question answering in images. In CVPR, pp. 4995–5004. Cited by: §6.1.