Incorporating External Knowledge to Answer Open-Domain Visual Questions with Dynamic Memory Networks

12/03/2017 ∙ by Guohao Li, et al. ∙ Tsinghua University 0

Visual Question Answering (VQA) has attracted much attention since it offers insight into the relationships between the multi-modal analysis of images and natural language. Most of the current algorithms are incapable of answering open-domain questions that require to perform reasoning beyond the image contents. To address this issue, we propose a novel framework which endows the model capabilities in answering more complex questions by leveraging massive external knowledge with dynamic memory networks. Specifically, the questions along with the corresponding images trigger a process to retrieve the relevant information in external knowledge bases, which are embedded into a continuous vector space by preserving the entity-relation structures. Afterwards, we employ dynamic memory networks to attend to the large body of facts in the knowledge graph and images, and then perform reasoning over these facts to generate corresponding answers. Extensive experiments demonstrate that our model not only achieves the state-of-the-art performance in the visual question answering task, but can also answer open-domain questions effectively by leveraging the external knowledge.



There are no comments yet.


page 7

page 8

page 11

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Visual Question Answering (VQA) is a ladder towards a better understanding of the visual world, which pushes forward the boundaries of both computer vision and natural language processing. A system in VQA tasks is given a text-based question about an image, which is expected to generate a correct answer corresponding to the question. In general, VQA is a kind of Visual Turing Test, which rigorously assesses whether a system is able to achieve human-level semantic analysis of images 

[10, 13]. A system could solve most of the tasks in computer vision if it performs as well as or better than humans in VQA. In this case, it has garnered increasing attentions due to its numerous potential applications [2], such as providing a more natural way to improve human-computer interaction, enabling the visually impaired individuals to get information about images, etc.

Figure 1:

A real case of open-domain visual question answering based on internal representation of an image and external knowledge. Recent success of deep learning provides a good opportunity to implement the closed-domain VQAs, but it is incapable of answering open-domain questions when external knowledge is needed. In this example, the system should recognize the giraffes and then query the knowledge bases for the main diet of giraffes. In this paper, we propose to explore the external knowledge along with the image representation based on a dynamic memory network, which allows a multi-hop reasoning over several facts.

To fulfill VQA tasks, it requires to endow the responder to understand intention of the question, reason over visual elements of the image, and sometimes have general knowledge about the world. Most of the present methods solve VQA by jointly learning interactions and performing inference over the question and image contents based on the recent success of deep learning [19, 2, 22, 9, 8], which can be further improved by introducing the attention mechanisms [34, 32, 31, 17, 33, 1]. However, most of questions in the current VQA dataset are quite simple, which are answerable by analyzing the question and image alone [2, 28]. It can be debated whether the system can answer questions that require prior knowledge ranging common sense to subject-specific and even expert-level knowledge. It is attractive to develop methods that are capable of deeper image understanding by answering open-domain questions [28], which requires the system to have the mechanisms in connecting VQA with structured knowledge, as is shown in Fig. 1. Some efforts have been made in this direction, but most of them can only handle a limited number of predefined types of questions [26, 25].

Different from the text-based QA problem, it is unfavourable to conduct the open-domain VQA based on the knowledge-based reasoning, since it is inevitably incomplete to describe an image with structured forms [15]. The recent availability of large training datasets [28]

makes it feasible to train a complex model in an end-to-end fashion by leveraging the recent advances in deep neural networks (DNN) 

[2, 9, 34, 17, 1]. Nevertheless, it is non-trivial to integrate knowledge into DNN-based methods, since the knowledge are usually represented in a symbol-based or graph-based manner (e.g., Freebase [5], DBPedia [3]), which is intrinsically different from the DNN-based features. A few attempts are made in this direction [29], but it may involve much irrelevant information and fail to implement multi-hop reasoning over several facts.

The memory networks [27, 24, 16] offer an opportunity to address these challenges by reading from and writing to the external memory module, which is modeled by the actions of neural networks. Recently, it has demonstrated the state-of-the-art performance in numerous NLP applications, including the reading comprehension [20] and textual question answering [6, 16]. Some seminal efforts are also made to implement VQA based on dynamic memory networks [30], but it does not involve the mechanism to incorporate the external knowledge, making it incapable of answering open-domain visual questions. Nevertheless, the attractive characteristics motivate us to leverage the memory structures to encode the large-scale structured knowledge and fuse it with the image features, which offers an approach to answer open domain visual questions.

Figure 2: Overall architecture of our proposed KDMN network. Given an image and the corresponding questions, the visual objects of the input image and key words of the corresponding questions are extracted using the Fast-RCNN and syntax analysis, respectively. Afterwards, we propose to assess the importance of entities in the knowledge graph and retrieve the most informative context-relevant knowledge triples, which are fed to the memory network after embedding the candidate knowledge into a continuous feature space. Consequentially, we integrate the representations of images and extracted knowledge into a common space, and store the features in a dynamic memory module. The open-domain VQA can be implemented by interpreting the joint representation under attention mechanism.

1.1 Our Proposal

To address the aforementioned issues, we propose a novel Knowledge-incorporated Dynamic Memory Network framework (KDMN), which allows to introduce the massive external knowledge to answer open-domain visual questions by exploiting the dynamic memory network. It endows a system with an capability to answer a broad class of open-domain questions by reasoning over the image content incorporating the massive knowledge, which is conducted by the memory structures.

Different from most of existing techniques that focus on answering visual questions solely on the image content, we propose to address a more challenging scenario which requires to implement reasoning beyond the image content. The DNN-based approaches [2, 9, 34] are therefore not sufficient, since they can only capture information present in the training images. Recent advances witness several attempts to link the knowledge to VQA methods [26, 25], which make use of structured knowledge graphs and reason about an image on the supporting facts. Most of these algorithms first extract the visual concepts from a given image, and implement reasoning over the structured knowledge bases explicitly. However, it is non-trivial to extract sufficient visual attributes, since an image lacks the structure and grammatical rules as language. To address this issue, we propose to retrieve a bath of candidate knowledge corresponding to the given image and related questions, and feed them to the deep neural network implicitly. The proposed approach provides a general pipeline that simultaneously preserves the advantages of DNN-based approaches [2, 9, 34] and knowledge-based techniques [26, 25].

In general, the underlying symbolic nature of a Knowledge Graph (KG) makes it difficult to integrate with DNNs. The usual knowledge graph embedding models such as TransE [7] focus on link prediction, which is different from VQA task aiming to fuse knowledge. To tackle this issue, we propose to embed the entities and relations of a KG into a continuous vector space, such that the factual knowledge can be used in a more simple manner. Each knowledge triple is treated as a three-word SVO phrase, and embedded into a feature space by feeding its word-embedding through an RNN architecture. In this case, the proposed knowledge embedding feature shares a common space with other textual elements (questions and answers), which provides an additional advantage to integrate them more easily.

Once the massive external knowledge is integrated into the model, it is imperative to provide a flexible mechanism to store a richer representation. The memory network, which contains scalable memory with a learning component to read from and write to it, allows complex reasoning by modeling interaction between multiple parts of the data [27, 30]. In this paper, we adopt the most recent advance of Improved Dynamic Memory Networks (DMN+) [30] to implement the complex reasoning over several facts. Our model provides a mechanism to attend to candidate knowledge embedding in an iterative manner, and fuse it with the multi-modal data including image, text and knowledge triples in the memory component. The memory vector therefore memorizes useful knowledge to facilitate the prediction of the final answer. Compared with the DMN+ [30], we introduce the external knowledge into the memory network, and endows the system an ability to answer open-domain question accordingly.

To summarize, our framework is capable of conducting the multi-modal data reasoning including the image content and external knowledge, such that the system is endowed with a more general capability of image interpretation. Our main contributions are as follows:

  • To our best knowledge, this is the first attempt to integrating the external knowledge and image representation with a memory mechanism, such that the open-domain visual question answering can be conducted effectively with the massive knowledge appropriately harnessed;

  • We propose a novel structure-preserved method to embed the knowledge triples into a common space with other textual data, making it flexible to integrate different modalities of data in an implicit manner such as image, text and knowledge triples;

  • We propose to exploit the dynamic memory network to implement multi-hop reasonings, which has a capability to automatically retrieve the relevant information in the knowledge bases and infer the most probable answers accordingly.

2 Overview

In this section, we outline our model to implement the open-domain visual question answering. In order to conduct the task, we propose to incorporate the image content and external knowledge by exploiting the most recent advance of dynamic memory network [16, 30], yielding three main modules in Fig. 2. The system is therefore endowed with an ability to answer arbitrary questions corresponding to a specific image.

Considering of the fact that most of existing VQA datasets include a minority of questions that require prior knowledge, the performance therefore cannot reflect the particular capabilities. We automatically produce a collection of more challenging question-answer pairs, which require complex reasoning beyond the image contents by incorporating the external knowledge. We hope that it can serve as a benchmark for evaluating the capability of various VQA models on the open-domain scenarios .

Given an image, we apply the Fast-RCNN [11] to detect the visual objects of the input image, and extract keywords of the corresponding questions with syntax analysis. Based on these information, we propose to learn a mechanism to retrieve the candidate knowledge by querying the large-scale knowledge graph, yielding a subgraph of relevant knowledge to facilitate the question answering. During the past years, a substantial amount of large-scale knowledge bases have been developed, which store common sense and factual knowledge in a machine readable fashion. In general, each piece of structured knowledge is represented as a triple with and being two entities or concepts, and corresponding to the specific relationship between them. In this paper, we adopt external knowledge mined from ConceptNet [23], an open multilingual knowledge graph containing common-sense relationships between daily words, to aid the reasoning of open-domain VQA.

Our VQA model provides a novel mechanism to integrate image information with that extracted from the ConceptNet within a dynamic memory network. In general, it is non-trivial to integrate the structured knowledge with the DNN features due to their different modalities. To address this issue, we embed the entities and relations of the subgraph into a continuous vector space, which preserves the inherent structure of the KGs. The feature embedding provides a convenience to fuse with the image representation in a dynamic memory network, which builds on the attention mechanism and the memory update mechanism. The attention mechanism is responsible to produce the contextual vector with relevance inferred by the question and previous memory status. The memory update mechanism then renews the memory status based on the contextual vector, which can memorize useful information for predicting the final answer. The novelty lies the fact that these disparate forms of information are embedded into a common space based on memory network, which facilities the subsequent answer reasoning.

Finally, we generate a predicted answer by reasoning over the facts in the memory along with the image contents. In this paper, we focus on the task of multi-choice setting, where several multi-choice candidate answers are provided along with a question and a corresponding image. For each question, we treat every multi-choice answer as input, and predict whether the image-question-answer triplet is correct. The proposed model tries to choose one candidate answer with the highest probability by inferring the cross entropy error on the answers through the entire network.

3 Answer Open-Domain Visual Questions

In this section, we elaborate on the details and formulations of our proposed model for answering open-domain visual questions. We first retrieve an appropriate amount of candidate knowledge from the large-scale ConceptNet by analyzing the image content and the corresponding questions; afterward, we propose a novel framework based on dynamic memory network to embed these symbolic knowledge triples into a continuous vector space and store it in a memory bank; finally, we exploit these information to implement the open-domain VQA by fusing the knowledge with image representation.

3.1 Candidate Knowledge Retrieval

In order to answer the open-domain visual questions, we should sometime access information not present in the image by retrieving the candidate knowledge in the KBs. A desirable knowledge retrieval should include most of the useful information while ignore the irrelevant ones, which is essential to avoid model misleading and reduce the computation cost. To this end, we take the following three principles in consideration as (1) entities appeared in images and questions (key entities) are critical; (2) the importance of entities that have direct or indirect links to key entities decays as the number of link hops increases; (3) edges between these entities are potentially useful knowledge.

Following these principles, we propose a three-step procedure to retrieve that candidate knowledge that are relevant to the context of images and questions. The retrieval procedure pays more attention on graph nodes that are linked to semantic entities, which also takes account of graph structure for measuring edge importance.

In order to retrieve the most informative knowledge, we first extract the candidate nodes in the ConceptNet by analyzing the prominent visual objects in images with Fast-RCNN [11], and textual keywords with the Natural Language Toolkit [4]. Both of them are then associated with the corresponding semantic entities in ConceptNet [23]

by matching all possible n-grams of words. Afterwards, we retrieve the first-order subgraph using these selected nodes from ConceptNet 

[23], which includes all edges connecting with at least one candidate node. It is assumed that the resultant subgraph contains the most relevant information, which is sufficient to answer questions by reducing the redundancy. The resultant first-order knowledge subgraph is denoted as .

Finally, we compress the subgraph by evaluating and ranking the importance of edges in using a designed score function, and carefully select the top- edges along with the nodes for subsequent task. Specifically, we first assign initial weights for each subgraph node, e.g., the initial weights for visual object can be proportional to their corresponding bounding-box area such that the dominant objects receive more attention, the textual keywords are treated equally. Then, we calculate the importance score of each node in by traversing each edge and propagating node weights to their neighbors with a decay factor as


where is the number of link hops between the entity and . For simplicity, we ignore the edge direction and edge type (relation type), and define the importance of edge as the weights sum of two connected nodes as


In this paper, we take the top- edges ranked by as the final candidate knowledge for the given context, denoted as .

3.2 Knowledge Embedding in Memories

The candidate knowledge that we have extracted is represented in a symbolic triplet format, which is intrinsically incompatible with DNNs. This fact urges us to embed the entities and relation of knowledge triples into a continuous vector space. Moreover, we regard each entity-relation-entity triple as one knowledge unit, since each triple naturally represents one piece of fact. The knowledge units can be stored in memory slots for reading and writing, and distilled through an attention mechanism for the subsequent tasks.

In order to embed the symbolic knowledge triples into memory vector slots, we treat the entities and relations as words, and map them into a continuous vector space using word embedding [21]

. Afterwards, the embedded knowledge is encoded into a fixed-size vector by feeding it to a recurrent neural network (RNN). Specifically, we initialize the word-embedding matrix with a pre-trained GloVe word-embedding 

[21], and refine it simultaneously with the rest of procedure of question and candidate answer embedding. In this case, the entities and relations share a common embedding space with other textual elements (questions and answers), which makes them much more flexible to fuse later.

Afterwards, the knowledge triples are treated as SVO phrases of , and fed to to a standard two-layer stacked LSTM as


where is the word of the SVO phrase, , is the word embedding matrix [21], and is the internal state of LSTM cell when forwarding the SVO phrase. The rationale lies in the fact that the LSTM can capture the semantic meanings effectively when the knowledge triples are treated as SVO phrases.

For each question-answering context, we take the LSTM internal states of the relevant knowledge triples as memory vectors, yielding the embedded knowledge stored in memory slots as


where is the memory slot corresponding to the knowledge triples, which can be used for further answer inference. Note that the method is different from the usual knowledge graph embedding models, since our model aims to fuse knowledge with the latent features of images and text, whereas the alternative models such as TransE [7] focus on link prediction task.

3.3 Attention-based Knowledge Fusion with DNNs

We have stored relevant knowledge embeddings in memory slots for a given question-answer context, which allows to incorporate massive knowledge when is large. The external knowledge overwhelms other contextual information in quantity, making it imperative to distill the useful information from the candidate knowledge. The Dynamic Memory Network (DMN) [16, 30] provides a mechanism to address the problem by modeling interactions among multiple data channels. In the DMN module, an episodic memory vector is formed and updated during an iterative attention process, which memorizes the most useful information for question answering. Moreover, the iterative process brings a potential capability of multi-hop reasoning.

This DMN consists of an attention component which generates a contextual vector using the previous memory vector, and an episodic memory updating component which updates itself based on the contextual vector. Specifically, we propose a novel method to generate the query vector by feeding visual and textual features to a non-linear fully-connected layer to capture question-answer context information as


where and

are the weight matrix and bias vector, respectively; and,

, and are denoted as DNN features corresponding to the images, questions and multi-choice answers, respectively. The query vector captures information from question-answer context.

During the training process, the query vector initializes an episodic memory vector as . A iterative attention process is then triggered, which gradually refines the episodic memory until the maximum number of iterations steps is reached. By the iteration, the episodic memory will memorize useful visual and external information to answer the question.

Attention component. At the iteration, we concatenate each knowledge embedding with last iteration episodic memory and query vector , then apply the basic soft attention procedure to obtain the context vector as


where is the concatenated vector for the candidate memory at the iteration; is the element of representing the normalized attention weight for at the iteration; and, , and are parameters to be optimized in deep neural networks.

Hereby, we obtain the contextual vector , which captures useful external knowledge for updating episodic memory and providing the supporting facts to answer the open-domain questions.

Episodic memory updating component. We apply the memory update mechanism [24, 30] as


where and are parameters to be optimized. After the iteration, the episodic memory memorizes useful knowledge information to answer the open-domain question.

Compared with the DMN+ model implemented in  [30], we allows the dynamic memory network to incorporate the massive external knowledge into procedure of VQA reasoning. It endows the system with the capability to answer more general visual questions relevant but beyond the image contents, which is more attractive in practical applications.

Fusion with episodic memory and inference. Finally, we embed visual features along with the textual features and to a common space, and fuse them together using Hadamard product (element-wise multiplication) as


where , and are embedded features for image, question and answer, respectively; is the fused feature in this common space; and, , and are corresponding to the parameters in neural networks.

The final episodic memory are concatenated with the fused feature to predict the probability of whether the multi-choice candidate answer is correct as


where represents the index of multi-choice candidate answers; the supported knowledge triples are stored in ; and, and are the parameters to be optimized in the DNNs. The final choice are consequentially obtained once we have .

Our training objective is to learn parameters based on a cross-entropy loss function as


where represents the probability of predicting the answer , given the image , question and external knowledge ; represents the model parameters; is the number of training samples; and is the label for the sample. The model can be trained in an end-to-end manner once we have the candidate knowledge triples are retrieved from the original knowledge graph.

4 Experiments

In this section, we conduct extensive experiments to evaluate performance of our proposed model, and compare it with its variants and the alternative methods. We specifically implement the evaluation on a public benchmark dataset (Visual7W) [34] for the close-domain VQA task, and also generate numerous arbitrary question-answers pairs automatically to evaluate the performance on open-domain VQA. In this section, we first briefly review the dataset and the implementation details, and then report the performance of our proposed method comparing with several baseline models on both close-domain and open-domain VQA tasks.

4.1 Datasets

We train and evaluate our model on a public available large-scale visual question answering datasets, the Visual7W dataset [34], due to the diversity of question types. Besides, since there is no public available open-domain VQA dataset for evaluation now, we automatically build a collection of open-domain visual question-answer pairs to examine the potentiality of our model for answering open-domain visual questions.

4.1.1 Visual7W Dataset

The Visual7W dataset [34] is built based on a subset of images from Visual Genome [14], which includes questions in terms of (what, where, when, who, why, which and how) along with the corresponding answers in a multi-choice format. Similar as [34], we divide the dataset into training, validation and test subsets, with totally 327,939 question-answer pairs on 47,300 images. Compared with the alternative dataset, Visual7W has a diverse type of question-answer and image content [28], which provides more opportunities to assess the human-level capability of a system on the open-domain VQA.

4.1.2 Open-domain Question Generation

In this paper, we automatically generate numerous question-answer pairs by considering the image content and relevant background knowledge, which provides a test bed for the evaluation of a more realistic VQA task. Specifically, we generate a collection automatically based on the test image in the Visual7W by filling a set of question-answer templates, which means that the information is not present during the training stage. To make the task more challenging, we selectively sample the question-answer pairs that need to reasoning on both visual concept in the image and the external knowledge, making it resemble the scenario of the open-domain visual question answering. In this paper, we generate 16,850 open-domain question-answer pairs on images in Visual7W test split. More details on the QA generation and relevant information can be found in the supplementary material.

Figure 3: Example results on the Visual7W dataset for (close-domain) VQA tasks. Given an image and the corresponding question, we report the corresponding answers obtained via our algorithm. Specifically, pr denotes the predicted probability generated by our model, and pr-NoKG is the predicted probability by the ablative model of KDMN-NoKG. We make the predicted choices bold accordingly. The external knowledge triples are also provided if they are retrieved to support the joint reasoning by our method automatically. As is observed, the external knowledge is essential even for the conventional VQA tasks, e.g., in the 5th example, it is much easier to infer the place accordingly by incorporating external knowledge when a giraffe is recognized.

4.2 Implementation Details

In our experiments, we fix the joint-embedding common space dimension as 1024, word-embedding dimension as 300 and the dimension of LSTM internal states as 512. We use a pre-trained ResNet-101 [12] model to extract image feature, and select 20 candidate knowledge triples for each QA pair through the experiments. Empirical study demonstrates it is sufficient in our task although more knowledge triples are also allowed. The iteration number of a dynamic memory network update is set to 2, and the dimension of episodic memory is set to 2048, which is equal to the dimension of memory slots.

In this paper, we combine the candidate Question-Answer pair to generate a hypothesis, and formulate the multi-choice VQA problem as a classification task. The correct answer can be determined by choosing the one with the largest probability. In each iteration, we randomly sample a batch of 500 QA pairs, and apply stochastic gradient descent algorithm with a base learning rate of 0.0001 to tune the model parameters. The candidate knowledge is first retrieved, and other modules are trained in an end-to-end manner.

4.2.1 Comparison Methods

In order to analyze the contributions of each component in our knowledge-enhanced, memory-based model, we ablate our full model as follows:

  • KDMN-NoKG: baseline version of our model. No external knowledge involved in this model. Other parameters are set the same as full model.

  • KDMN-NoMem: a version without memory network. External knowledge triples are used by one-pass soft attention.

  • KDMN: our full model. External knowledge triples are incorporated in Dynamic Memory Network.

We also compare our method with several alternative VQA methods including (1) LSTM-Att [34], a LSTM model with spatial attention; (2) MemAUG [18]: a memory-augmented model for VQA; (3) MCB+Att [8]: a model combining multi-modal features by Multimodal Compact Bilinear pooling; (4) MLAN [33]

: an advanced multi-level attention model.

Figure 4: Example results of open-domain visual question answering based on our proposed knowledge-incorporate dynamic memory network. Given an images, we automatically generate the open-domain question-answer pair by considering of the image content and the relevant background knowledge. We report the corresponding answers obtained via our algorithm. Specifically, pr denotes the predicted probability generated by our model, and pr-NoKG is the predicted probability by the ablative model of KDMN-NoKG. The results demonstrate that external knowledge plays an essential role in answer open-questions. A system is incapable of inferring the food in the 1st example and the stuff prices in the 3rd example.

4.3 Results and Analysis

In this section, we report the quantitative evaluation along with representative samples of our method, compared with our ablative models and the state-of-the-art method for both the conventional (close-domain) VQA task and open-domain VQA.

4.3.1 VQA Task

In this section, we report the quantitative accuracy in Table 1 along with the sample results in 3. The overall results demonstrate that our algorithm obtains different boosts compared with the competitors on various kinds of questions, e.g., significant improvements on the questions of Who (), and What () questions, and slightly boost on the questions of When () and How (). After inspecting the success and failure cases, we found that the Who and What questions have larger diversity in questions and multi-choice answers compared to other types, therefore benefit more from external background knowledge. Note that compared with the method of MemAUG [18] in which a memory mechanism is also adopted, our algorithm still gain significant improvement, which further confirms our belief that the background knowledge provides critical supports.

We further make comprehensive comparisons among our ablative models. To make it fair, all the experiments are implemented on the same basic network structure and share the same hyper-parameters. In general, our KDMN model on average gains over the KDMN-NoMem model and over the KDMN-NoKG model, which further implies the effectiveness of dynamic memory networks in exploiting external knowledge. Through iterative attention processes, the episodic memory vector captures background knowledge distilled from external knowledge embeddings. The KDMN-NoMem model gains over the KDMN-NoKG model, which implies that the incorporated external knowledge brings additional advantage, and act as a supplementary information for predicting the final answer. The indicative examples in Fig. 3 also demonstrate the impact of external knowledge, such as the 4th example of “Why is the light red?”. It would be helpful if we could retrieve the function of the traffic lights from the external knowledge effectively.

Methods What Where When Who Why How Average
LSTM-Att.[34] 51.5 57.0 75.0 59.5 55.5 49.8 54.3
MCB + Att.[8] 60.3 70.4 79.5 69.2 58.2 51.1 62.2
MemAUG [18] 62.2 68.9 76.8 66.4 57.8 52.9 62.8
MLAN [33] 60.5 71.2 79.6 69.4 58.0 50.8 62.4
KDMN-NoKG 59.7 69.6 79.9 68.0 61.6 51.3 62.0
KDMN-NoMem 62.1 71.5 81.1 72.5 62.9 54.0 64.4
KDMN 64.6 73.1 81.3 73.9 64.1 53.3 66.0
Ensemble 67.9 77.0 83.3 77.2 69.0 56.8 69.4
Table 1: Accuracy on Visual7W dataset

4.3.2 Open-Domain VQA

In this section, we report the quantitative performance of open-domain VQA in Table 2 along with the sample results in Fig. 4. Since most of the alternative methods do not provide the results in the open-domain scenario, we make comprehensive comparison with our ablative models. As expected, we observe that a significant improvement () of our full KDMN model over the KDMN-NoKG model, where attributes to the involvement of external knowledge and attributes to the usage of memory network. Examples in Fig. 4 further provide some intuitive understanding of our algorithm. It is difficult or even impossible for a system to answer the open domain question when comprehensive reasoning beyond image content is required, e.g., the background knowledge for prices of stuff is essential for a machine when inferring the expensive ones. The larger performance improvement on open-domain dataset supports our belief that background knowledge is essential to answer general visual questions. Note that the performance can be further improved if the technique of ensemble is allowed. We fused the results of several KDMN models which are trained from different initializations. Experiments demonstrate that we can further obtain an improvement about .

Methods Accuracy
KDMN-NoKG 45.1
KDMN-NoMem 51.9
KDMN 57.8
Ensemble 60.9
Table 2: Accuracy on our generated open-domain dataset.

5 Conclusion

In this paper, we proposed a novel framework named knowledge-incorporate dynamic memory network (KDMN) to answer open-domain visual questions by harnessing massive external knowledge in dynamic memory network. Context-relevant external knowledge triples are retrieved and embedded into memory slots, then distilled through a dynamic memory network to jointly inference final answer with visual features. The proposed pipeline not only maintains the superiority of DNN-based methods, but also acquires the ability to exploit external knowledge for answering open-domain visual questions. Extensive experiments demonstrate that our method achieves competitive results on public large-scale dataset, and gain huge improvement on our generated open-domain dataset.


  • [1] P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L. Zhang. Bottom-up and top-down attention for image captioning and vqa. arXiv preprint arXiv:1707.07998, 2017.
  • [2] S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. Lawrence Zitnick, and D. Parikh. Vqa: Visual question answering. In Proceedings of the IEEE International Conference on Computer Vision, pages 2425–2433, 2015.
  • [3] S. Auer, C. Bizer, G. Kobilarov, J. Lehmann, R. Cyganiak, and Z. Ives. Dbpedia: A nucleus for a web of open data. The semantic web, pages 722–735, 2007.
  • [4] S. Bird, E. Klein, and E. Loper. Natural language processing with Python: analyzing text with the natural language toolkit. ” O’Reilly Media, Inc.”, 2009.
  • [5] K. Bollacker, C. Evans, P. Paritosh, T. Sturge, and J. Taylor. Freebase: a collaboratively created graph database for structuring human knowledge. In Proceedings of the 2008 ACM SIGMOD international conference on Management of data, pages 1247–1250. AcM, 2008.
  • [6] A. Bordes, N. Usunier, S. Chopra, and J. Weston. Large-scale simple question answering with memory networks. arXiv preprint arXiv:1506.02075, 2015.
  • [7] A. Bordes, N. Usunier, A. Garcia-Duran, J. Weston, and O. Yakhnenko. Translating embeddings for modeling multi-relational data. In Advances in neural information processing systems, pages 2787–2795, 2013.
  • [8] A. Fukui, D. H. Park, D. Yang, A. Rohrbach, T. Darrell, and M. Rohrbach. Multimodal compact bilinear pooling for visual question answering and visual grounding. arXiv preprint arXiv:1606.01847, 2016.
  • [9] H. Gao, J. Mao, J. Zhou, Z. Huang, L. Wang, and W. Xu. Are you talking to a machine? dataset and methods for multilingual image question. In Advances in Neural Information Processing Systems, pages 2296–2304, 2015.
  • [10] D. Geman, S. Geman, N. Hallonquist, and L. Younes. Visual turing test for computer vision systems. Proceedings of the National Academy of Sciences, 112(12):3618–3623, 2015.
  • [11] R. Girshick. Fast r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 1440–1448, 2015.
  • [12] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    , pages 770–778, 2016.
  • [13] K. Kafle and C. Kanan. Visual question answering: Datasets, algorithms, and future challenges. Computer Vision and Image Understanding, 2017.
  • [14] R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. Kalantidis, L.-J. Li, D. A. Shamma, et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision, 123(1):32–73, 2017.
  • [15] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
  • [16] A. Kumar, O. Irsoy, P. Ondruska, M. Iyyer, J. Bradbury, I. Gulrajani, V. Zhong, R. Paulus, and R. Socher. Ask me anything: Dynamic memory networks for natural language processing. In

    International Conference on Machine Learning

    , pages 1378–1387, 2016.
  • [17] J. Lu, J. Yang, D. Batra, and D. Parikh. Hierarchical question-image co-attention for visual question answering. In Advances In Neural Information Processing Systems, pages 289–297, 2016.
  • [18] C. Ma, C. Shen, A. Dick, and A. v. d. Hengel. Visual question answering with memory-augmented networks. arXiv preprint arXiv:1707.04968, 2017.
  • [19] M. Malinowski, M. Rohrbach, and M. Fritz.

    Ask your neurons: A neural-based approach to answering questions about images.

    In Proceedings of the IEEE international conference on computer vision, pages 1–9, 2015.
  • [20] B. Pan, H. Li, Z. Zhao, B. Cao, D. Cai, and X. He. Memen: Multi-layer embedding with memory networks for machine comprehension. arXiv preprint arXiv:1707.09098, 2017.
  • [21] J. Pennington, R. Socher, and C. Manning. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 1532–1543, 2014.
  • [22] M. Ren, R. Kiros, and R. Zemel. Image question answering: A visual semantic embedding model and a new dataset. Proc. Advances in Neural Inf. Process. Syst, 1(2):5, 2015.
  • [23] R. Speer and C. Havasi. Representing general relational knowledge in ConceptNet 5. In LREC, pages 3679–3686, 2012.
  • [24] S. Sukhbaatar, J. Weston, R. Fergus, et al. End-to-end memory networks. In Advances in neural information processing systems, pages 2440–2448, 2015.
  • [25] P. Wang, Q. Wu, C. Shen, A. Dick, and A. van den Hengel. Fvqa: fact-based visual question answering. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017.
  • [26] P. Wang, Q. Wu, C. Shen, A. v. d. Hengel, and A. Dick. Explicit knowledge-based reasoning for visual question answering. arXiv preprint arXiv:1511.02570, 2015.
  • [27] J. Weston, S. Chopra, and A. Bordes. Memory networks. arXiv preprint arXiv:1410.3916, 2014.
  • [28] Q. Wu, D. Teney, P. Wang, C. Shen, A. Dick, and A. van den Hengel. Visual question answering: A survey of methods and datasets. Computer Vision and Image Understanding, 2017.
  • [29] Q. Wu, P. Wang, C. Shen, A. Dick, and A. van den Hengel. Ask me anything: Free-form visual question answering based on knowledge from external sources. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4622–4630, 2016.
  • [30] C. Xiong, S. Merity, and R. Socher. Dynamic memory networks for visual and textual question answering. In International Conference on Machine Learning, pages 2397–2406, 2016.
  • [31] H. Xu and K. Saenko. Ask, attend and answer: Exploring question-guided spatial attention for visual question answering. In European Conference on Computer Vision, pages 451–466. Springer, 2016.
  • [32] Z. Yang, X. He, J. Gao, L. Deng, and A. Smola. Stacked attention networks for image question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 21–29, 2016.
  • [33] D. Yu, J. Fu, T. Mei, and Y. Rui. Multi-level attention networks for visual question answering. In Conf. on Computer Vision and Pattern Recognition, 2017.
  • [34] Y. Zhu, O. Groth, M. Bernstein, and L. Fei-Fei. Visual7W: Grounded Question Answering in Images. In IEEE Conference on Computer Vision and Pattern Recognition, 2016.

6 Supplementary Material

6.1 Details of our Open-domain Dataset Generation

KG Triple QA templates
(visual, UsedFor, other) what in this image can be used for {other}?
(other, UsedFor, visual) what in this image can {other} be used for?
(visual, PartOf, other) what in this image is a part of {other}?
(other, PartOf, visual) what in this image has {other} as a part??
(visual, HasProperty, other) what in this image has the property of {other}?
(other, HasProperty, visual) what property does the {other} in this image have?
(visual, HasA, other) what in this image has {other}?
(other, HasA, visual) what in this image belongs to {other}?
(visual, CapableOf, other) what in this image is capable of {other}?
(other, CapableOf, visual) what in this image is {other} capable of?
Table 3: Templates for generate open-domain question-answer pairs. {visual} is the KG entity representing visual object. {other} is the KG entity that has a connection with {visual}. We take {visual} as the generated ground-truth answer.

We obey several principles when building the open-domain VQA dataset for evaluation: (1) The question-answer pairs should be generated automatically; (2) Both of visual information and external knowledge should be required when answering these generated open-domain visual questions; (3) The dataset should in multi-choices setting, in accordance with the Visual7W dataset for fair comparison.

The open-domain question-answer pairs are generated based on a subset of images in Visual7W [34] standard test split, so that the test images are not present during the training stage. For one particular image that we need to generate open-domain question-answer pairs about, we firstly extract several prominent visual objects and randomly select one visual object. After linked to a semantic entity in ConceptNet [23], the visual object connects other entities in ConceptNet through various relations, e.g. UsedFor, CapableOf, and forms amount of knowledge triples , where either or is the visual object. Again, we randomly select one knowledge triple, and fill into a -related question-answer template to obtain the question-answer pair. These templates assume that the correct answer satisfies knowledge requirement as well as appear in the image, as shown in table 3.

Figure 5: Examples from our generated open-domain dataset. We mark ground-truth answers green. The bottom KG triples just provide insights into the generation process, and will not be included in the dataset. The candidate answers can be quite confusing in some questions, e.g., in the 1st example, the ground-truth “candle” appearing in the image can be used for light, while “cake” also appears in the image but cannot be used for light, “sun” can also be used for light but not appear in the image.

For each open-domain question-answer pair, we generate three additional confusing items as candidate answers. These candidate answers are randomly sampled from a collection of answers, which is composed of answers from other question-answer pairs belonging to the same type. In order to make the open-domain dataset more challenging, we selectively sample confusing answers, which either satisfy knowledge requirement or appear in the image, but not satisfy both of them as the ground-truth answers do. Specifically, one of the confusing answers satisfies knowledge requirement but not appears in image, so that the model must attend to visual objects in image; another one of the confusing answers appears in the image but not satisfies knowledge requirement, so that the model must reason on external knowledge to answer these open-domain questions. Please see examples in Figure 5.

In total, we generate 16,850 open-domain question-answer pairs based on 8,425 images in Visual7W test split.