Out of the Box: Reasoning with Graph Convolution Nets for Factual Visual Question Answering

11/01/2018 ∙ by Medhini Narasimhan, et al. ∙ University of Illinois at Urbana-Champaign 10

Accurately answering a question about a given image requires combining observations with general knowledge. While this is effortless for humans, reasoning with general knowledge remains an algorithmic challenge. To advance research in this direction a novel `fact-based' visual question answering (FVQA) task has been introduced recently along with a large set of curated facts which link two entities, i.e., two possible answers, via a relation. Given a question-image pair, deep network techniques have been employed to successively reduce the large set of facts until one of the two entities of the final remaining fact is predicted as the answer. We observe that a successive process which considers one fact at a time to form a local decision is sub-optimal. Instead, we develop an entity graph and use a graph convolutional network to `reason' about the correct answer by jointly considering all entities. We show on the challenging FVQA dataset that this leads to an improvement in accuracy of around 7



There are no comments yet.


page 2

page 8

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

When answering questions about images, we easily combine the visualized situation with general knowledge that is available to us. However, for algorithms, an effortless combination of general knowledge with observations remains challenging, despite significant work which aims to leverage these mechanisms for autonomous agents and virtual assistants.

In recent years, a significant amount of research has investigated algorithms for visual question answering (VQA) VQA ; ShihCVPR2016 ; LuARXIV2016 ; ZhangARXIV2015 ; JabriARXIV2016 ; SchwartzNIPS2017 , visual question generation (VQG) JainZhangCVPR2017 ; mostafazadeh2016generating ; li2017visual ; vijayakumar2016diverse , and visual dialog visdial ; das2017learning ; JainCVPR2018 , paving the way to autonomy for artificial agents operating in the real world. Images and questions in these datasets cover a wide range of perceptual abilities such as counting, object recognition, object localization, and even logical reasoning. However, for many of these datasets the questions can be answered solely based on the visualized content, , no general knowledge is required. Therefore, numerous approaches address VQA, VQG and dialog tasks by extracting visual cues using deep network architectures VQA ; ZhuCVPR2016 ; MalinowskiICCV2015 ; RenNIPS2015 ; ShihCVPR2016 ; XiongICML2016 ; XuARXIV2015 ; YangCVPR2016 ; FukuiARXIV2016 ; KimARXIV2016 ; ZhangARXIV2015 ; ZitnickAI2016 ; AndreasCVPR2016 ; DasARXIV2016 ; JabriARXIV2016 ; YuARXIV2015 ; ZhouARXIV2015 ; WuARXIV2016 ; MaARXIV2015 ; LuARXIV2016 ; gordon2018iqa , while general knowledge remains unavailable.

To bridge this discrepancy between human behavior and present day algorithmic design, Wang  wang2018fvqa introduced a novel ‘fact-based’ VQA (FVQA) task, and an accompanying dataset containing images, questions with corresponding answers and a knowledge base (KB) of facts extracted from three different sources: WebChild tandon2014webchild , DBPedia Auer2007dbpedia and ConceptNet speer2017conceptnet . Unlike classical VQA datasets, a question in the FVQA dataset is answered by a collective analysis of the information in the image and the KB of facts. Each question is mapped to a single supporting fact which contains the answer to the question. Thus, answering a question requires analyzing the image and choosing the right supporting fact, for which Wang  wang2018fvqa propose a keyword-matching technique. This approach suffers when the question doesn’t focus on the most obvious visual concept and when there are synonyms and homographs. Moreover, special information about the visual concept type and the answer source make it hard to generalize their approach to other datasets. We addressed these issues in our previous work narasimhan2018straight , where we proposed a learning-based approach which embeds the image question pairs and the facts to the same space and ranks the facts according to their relevance. We observed a significant improvement in performance which motivated us to explore other learning based methods, particularly those which exploit the graphical structure of the facts.

Figure 1: Results of our graph convolutional net based approach on the recently introduced FVQA dataset.

In this work, our main motivation is to develop a technique which uses the information from multiple facts before arriving at an answer and relies less on retrieving the single ‘correct’ fact needed to answer a question. To this end, we develop a model which ‘thinks out of the box,’ , it ‘reasons’ about the right answer by taking into account a list of facts via a Graph Convolution Network (GCN) kipf2016semi . The GCN enables joint selection of the answer from a list of candidate answers, which sets our approach apart from the previous methods that assess one fact at a time. Moreover, we select a list of supporting facts in the KB by ranking GloVe embeddings. This handles challenges due to synonyms and homographs and also works well with questions that don’t focus on the main object.

We demonstrate the proposed algorithm on the FVQA dataset wang2018fvqa , outperforming the state of the art by around 7%. fig:overview shows results obtained by our model. Unlike the models proposed in wang2018fvqa , our method does not require any information about the ground truth fact (visual concept type and answer source). In contrast to our approach in narasimhan2018straight , which focuses on learning a joint image-question-fact embedding for retrieving the right fact, our current work uses a simpler method for retrieving multiple candidate facts (while still ensuring that the recall of the ground truth fact is high), followed by a novel GCN inference step that collectively assesses all the relevant facts before arriving at an answer. Using an ablation analysis we find improvements due to the GCN component, which exploits the graphical structure of the knowledge base and allows for sharing of information between possible answers, thus improving the explainability of our model.

2 Related Work

We develop a visual question answering algorithm based on graph convolutional nets which benefits from general knowledge encoded in the form of a knowledge base. We therefore briefly review existing work in the areas of visual question answering, fact-based visual question answering and graph convolutional networks.

Visual Question Answering: Recently, there has been significant progress in creating large VQA datasets MalinowskiNIPS2014 ; RenNIPS2015 ; VQA ; GaoNIPS2015 ; ZhuCVPR2016 ; JohnsonCVPR2017Clevr and deep network models which correctly answer a question about an image. The initial VQA models RenNIPS2015 ; VQA ; GaoNIPS2015 ; LuARXIV2016 ; YangCVPR2016 ; AndreasCVPR2016 ; DasARXIV2016 ; FukuiARXIV2016 ; ShihCVPR2016 ; XuARXIV2015 ; MalinowskiICCV2015 ; MaARXIV2015 ; KimARXIV2016 ; ZitnickAI2016 ; JabriARXIV2016 ; YuARXIV2015 ; ZhouARXIV2015 ; WuARXIV2016 ; BenyounesICCV2017Mutan combined the LSTM encoding of the question and the CNN encoding of the image using a deep network which finally predicted the answer. Results can be improved with attention-based multi-modal networks LuARXIV2016 ; YangCVPR2016 ; AndreasCVPR2016 ; DasARXIV2016 ; FukuiARXIV2016 ; ShihCVPR2016 ; XuARXIV2015 ; SchwartzNIPS2017 and dynamic memory networks XiongICML2016 ; jiang2015compositional . All of these methods were tested on standard VQA datasets where the questions can solely be answered by observing the image. No out of the box thinking was required. For example, given an image of a cat, and the question, “Can the animal in the image be domesticated?,” we want our method to combine features from the image with common sense knowledge (a cat can be domesticated). This calls for the development of a model which leverages external knowledge.

Fact-based Visual Question Answering: Recent research in using external knowledge for natural language comprehension led to the development of semantic parsing ZettlemoyerUAI2005 ; ZettlemoyerACL2005 ; BerantEMNLP2013 ; CaiACL2013 ; LiangCL2013 ; KwiatkowskiEMNLP2013 ; BerantACL2014 ; FaderKDD2014 ; YihACL2015 ; ReddyACL2016 ; XiaoACL2016 ; zhang2016question ; narasimhan2018straight and information retrieval UngerWWW2012 ; KolomiyetsIS2011 ; YaoACL2014 ; BordesEMNLP2014 ; BordesECML2014 ; DongACL2015 ; BordesICLR2015 methods. However, knowledge based visual question answering is fairly new. Notable examples in this direction are works by Zhu  ZhuARXIV2015 , Wu  WuCVPR2016 , Wang  Wang2017Ahab , Narasimhan  NarasimhanEMNLP2016 , Krishnamurthy and Kollar KrishnamurthyACL2013 , and our previous work, Narasimhan and Schwing narasimhan2018straight .

Ask Me Anything (AMA) by Wu  WuCVPR2016 , AHAB by Wang  Wang2017Ahab , and FVQA by Wang  wang2018fvqa are closely related to our work. In AMA, attribute information extracted from the image is used to query the external knowledge base DBpedia Auer2007dbpedia

, to retrieve paragraphs which are summarized to form a knowledge vector. The knowledge vector is combined with the attribute vector and multiple captions generated for the image, before being passed as input to an LSTM which predicts the answer. The main drawback of AMA is that it does not perform any explicit reasoning and ignores the possible structure in the KB. To address this, AHAB and FVQA attempt to perform explicit reasoning. In AHAB, the question is converted to a database query via a multistep process, and the response to the query is processed to obtain the final answer. FVQA also learns a mapping from questions to database queries through classifying questions into categories and extracting parts from the question deemed to be important. A matching score is computed between the facts retrieved from the database and the question, to determine the most relevant fact which forms the basis of the answer for the question. Both these methods use databases with a particular structure: facts are represented as tuples, for example, (

Apple, IsA, Fruit), and (Cheetah, FasterThan, Lion).

The present work follows up on our earlier method, Straight to the Facts (STTF) narasimhan2018straight

. STTF uses object, scene, and action predictors to represent an image and an LSTM to represent a question and combines the two using a deep network. The facts are scored based on the cosine similarity of the image-question embedding and fact embedding. The answer is extracted from the highest scoring fact.

We evaluate our method on the dataset released as part of the FVQA work, referred to as the FVQA dataset wang2018fvqa , which is a subset of three structured databases – DBpedia Auer2007dbpedia , ConceptNet speer2017conceptnet , and WebChild tandon2014webchild .

Graph Convolutional Nets: Kipf and Welling kipf2016semi introduced Graph Convolutional Networks (GCN) to extend Conv nets (CNNs) lecun1998gradient to arbitrarily connected undirected graphs. GCNs learn representations for every node in the graph that encodes both the local structure of the graph surrounding the node of interest, as well as the features of the node itself. At a graph convolutional layer, features are aggregated from neighboring nodes and the node itself to produce new output features. By stacking multiple layers, we are able to gather information from nodes further away. GCNs have been applied successfully for graph node classification kipf2016semi , graph link prediction schlichtkrull2018modeling , and zero-shot prediction wang2018zero

. Knowledge graphs naturally lend themselves to applications of GCNs owing to the underlying structured interactions between nodes connected by relationships of various types. In this work, given an image and a question about the image, we first identify useful sub-graphs of a large knowledge graph such as DBpedia 

Auer2007dbpedia and then use GCNs to produce representations encoding node and neighborhood features that can be used for answering the question.

Specifically, we propose a model that retrieves the most relevant facts to a question-answer pair based on GloVe features. The sub-graph of facts is passed through a graph convolution network which predicts an answer from these facts. Our approach has the following advantages: 1) Unlike FVQA and AHAB, we avoid the step of query construction and do not use the ground truth visual concept or answer type information which makes it possible to incorporate any fact space into our model. 2) We use GloVe embeddings for retrieving and representing facts which works well with synonyms and homographs. 3) In contrast to STTF, which uses a deep network to arrive at the right fact, we use a GCN which operates on a subgraph of relevant facts while retaining the graphical structure of the knowledge base which allows for reasoning using message passing. 4) Unlike previous works, we have reduced the reliance on the knowledge of the ground truth fact at training time.

3 Visual Question Answering with Knowledge Bases

To jointly ‘reason’ about a set of answers for a given question-image pair, we develop a graph convolution net (GCN) based approach for visual question answering with knowledge bases. In the following we first provide an overview of the proposed approach before delving into details of the individual components.

Figure 2: Outline of the proposed approach: Given an image and a question, we use a similarity scoring technique (1) to obtain relevant facts from the fact space. An LSTM (2) predicts the relation from the question to further reduce the set of relevant facts and its entities. An entity embedding is obtained by concatenating the visual concepts embedding of the image (3), the LSTM embedding of the question (4), and the LSTM embedding of the entity (5). Each entity forms a single node in the graph and the relations constitute the edges (6). A GCN followed by an MLP performs joint assessment (7) to predict the answer. Our approach is trained end-to-end.

Overview: Our proposed approach is outlined in fig:outline. Given an image and a corresponding question , the task is to predict an answer while using an external knowledge base KB which consists of facts, , , KB = A fact is represented as a Resource Distribution Framework (RDF) triplet of the form , where is a visual concept grounded in the image, is an attribute or phrase, and is a relation between the two entities, and . The relations in the knowledge base are part of a set of 13 possible relations Category, Comparative, HasA, IsA, HasProperty, CapableOf, Desires, RelatedTo, AtLocation, PartOf, ReceivesAction, UsedFor, CreatedBy. Subsequently we use , , or to extract the visual concept , the attribute phrase , or the relation in fact respectively.

Every question is associated with a single fact, , that helps answer the question. More specifically, the answer is one of the two entities of that fact, , either or , both of which can be extracted from .

Wang  wang2018fvqa formulate the task as prediction of a fact for a given question-image pair, and subsequently extract either or , depending on the result of an answer source classifier. As there are over facts, retrieving the correct supporting fact is challenging and computationally inefficient. Usage of question properties like ‘visual concept type’ makes the proposed approach hard to extend.

Guided by the observation that the correct supporting fact is within the top-100 of a retrieval model 84.8% of the time, we develop a two step solution: (1) retrieving the most relevant facts for a given question-image pair. To do this, we extract the top-100 facts, , based on word similarity between the question and the fact. Further, we obtain the set of relevant facts by reducing based on consistency of the fact relation with a predicted relation . (2) predicting the answer as one of the entities in this reduced fact space . To predict the answer we use a GCN to compute representations of nodes in a graph, where the nodes correspond to the unique entities , , either or in the fact space . Two entities in the graph are connected if a fact relates the two. Using a GCN permits to jointly assess the suitability of all entities which makes our proposed approach different from classification based techniques.

For example, consider the image and the question shown in fig:outline. The relation for this question is “IsA” and the fact associated with this question-image pair is (Orange, IsA, Citric). The answer is Orange. In the following we first discuss retrieval of the most relevant facts for a given question-image pair before detailing our GCN approach for extracting the answer from this reduced fact space.

3.1 Retrieval of Relevant Facts

To retrieve a set of relevant facts for a given question-image pair, we pursue a score based approach. We first compute the cosine similarity of the GloVe embeddings of the words in the fact with the words in the question and the words of the visual concepts detected in the image. Because some words may differ between question and fact, we obtain a fact score by averaging the Top-K word similarity scores. We rank the facts based on their similarity and retrieve the top-100 facts for each question, which we denote . We chose 100 facts as this gives the best downstream accuracy as shown in table:ans_ret2. As indicated in table:ans_ret2, we observe a high recall of the ground truth fact in the retrieved facts while using this technique. This motivates us to avoid a complex model which finds the right fact, as used in wang2018fvqa and narasimhan2018straight , and instead use the retrieved facts to directly predict the answer.

We further reduce this set of 100 facts by assessing their relation attribute. To predict the relation from a given question, we use the approach described in narasimhan2018straight . We retain the facts among the top-100 only if their relation agrees with the predicted relation , , .

For every question, unique entities in the facts are grouped into a set of candidate entities, , with (2 entities/fact and at most 100 facts).

Currently, we train the relation predictor’s parameters independently of the remaining model. In future work we aim for an end-to-end model which includes this step.

3.2 Answer Prediction

Given the set of candidate entities , we want to ‘reason’ about the answer, , we want to predict an entity . To jointly assess the suitability of all candidate entities in

, we develop a Graph-Convolution Net (GCN) based approach which is augmented by a multi-layer perceptron (MLP). The nodes in the employed graph correspond to the available entities

and their node representation is given as an input to the GCN. The GCN combines entity representations in multiple iterative steps. The final transformed entity representations learned by the GCN are then used as input in an MLP which predicts a binary label, , , for each entity , indicating if is or isn’t the answer.

More formally, the goal of the GCN is to learn how to combine representations for the nodes of a graph, = . Its output feature representations depend on: (1) learnable weights; (2) an adjacency matrix describing the graph structure . We consider two entities to be connected if they belong to the same fact; (3) a parametric input representation for every node of the graph. We subsume the original feature representations of all nodes in an -dimensional feature matrix , where is the number of features. In our case, each node is represented by the concatenation of the corresponding image, question and entity representation, , . Combining the three representations ensures that each node/entity depends on the image and the question. The node representation is discussed in detail below.

The GCN consists of hidden layers where each layer is a non-linear function . Specifically, H^(l) = f(H^(l-1), A) = σ(~D^-1/2~A~D^-1/2H^(l-1)W^(l-1))  ∀l∈{1, …, L}, where the input to the GCN is , = (

is an identity matrix),

is the diagonal node degree matrix of , is the matrix of trainable weights at the -th layer of the GCN, and

is a non-linear activation function. We let the

-dimensional vector refer to the output of the GCN, extracted from . Hereby, is the number of output features.

The output of the GCN,

is passed through an MLP to obtain the probability

that is the answer for the given question-image pair. We obtain our predicted answer via ^A = argmax_e∈E p_w^NN(^g(e)). As mentioned before, each node is represented by the concatenation of the corresponding image, question and entity representation, , . We discuss those three representations subsequently.

max width= @1 @50 @100 @150 @200 @500 Fact Recall 22.6 76.5 84.8 88.4 91.6 93.1 Downstream Accuracy 22.6 58.93 69.35 68.23 65.61 60.22

Table 1: Recall and downstream accuracy for different number of facts.

1. Image Representation: The image representation, is a multi-hot vector of size 1176, indicating the visual concepts which are grounded in the image. Three types of visual concepts are detected in the image: actions, scenes and objects. These are detected using the same pre-trained networks described in narasimhan2018straight .

2. Question Representation: An LSTM net is used to encode each question into the representation . The LSTM is initialized with GloVe embeddings pennington2014GloVe

for each word in the question, which is fine-tuned during training. The hidden representation of the LSTM constitutes the question encoding.

3. Entity Representation: For each question, the entity encoding is computed for every entity in the entity set . Note that an entity is generally composed of multiple words. Therefore, similar to the question encoding, the hidden representation of an LSTM net is used. It is also initialized with the GloVe embeddings pennington2014GloVe of each word in the entity, which is fine-tuned during training.

The answer prediction model parameters consists of weights from the question embedding, entity embedding, GCN, and MLP. These are trained end-to-end.

3.3 Learning

We note that the answer prediction and relation prediction model parameters are trained separately. The dataset, , to train both these parameters is obtained from wang2018fvqa . It contains tuples each composed of an image , a question , as well as the ground-truth fact and answer .

To train the relation predictor’s parameters we use the subset , containing pairs of questions and the corresponding relations extracted from the ground-truth fact

. Stochastic gradient descent and classical cross-entropy loss are used to train the classifier.

The answer predictor’s parameters, consist of the question and entity embeddings, the two hidden layers of the GCN, and the layers of the MLP. The model operates on question-image pairs and extracts the entity label from the ground-truth answer of the dataset , , 0 if it isn’t the answer and 1 if it is. Again we use stochastic gradient descent and binary cross-entropy loss.

Method Accuracy
@1 @3
LSTM-Question+Image+Pre-VQA wang2018fvqa 24.98 40.40
Hie-Question+Image+Pre-VQA wang2018fvqa 43.41 59.44
FVQA wang2018fvqa 56.91 64.65
Ensemble wang2018fvqa 58.76 -
Straight to the Facts (STTF) narasimhan2018straight 62.20 75.60
Ours Q VC Entity MLP GCN Layers Rel
1 - - - 10.32 13.15
2 - - @1 13.89 16.40
3 - 2 @1 14.12 17.75
4 - - 29.72 35.38
5 - @1 50.36 56.21
6 - 2 @1 48.43 53.87
7 1 @1 54.60 60.91
8 1 @3 57.89 65.14
9 3 @1 56.90 62.32
10 3 @3 60.78 68.65
11 2 @1 65.80 77.32
12 2 @3 69.35 80.25
13 2 gt 72.97 83.01
Human 77.99 -
Table 2: Answer accuracy over the FVQA dataset.

4 Experimental Evaluation

Before assessing the proposed approach subsequently, we first review properties of the FVQA dataset. We then present quantitative results to compare our proposed approach with existing baselines before illustrating qualitative results.

Factual visual question answering dataset: To evaluate our model, We use the publicly available FVQA wang2018fvqa knowledge base and dataset. This dataset consists of 2,190 images, 5,286 questions, and 4,126 unique facts corresponding to the questions. The knowledge base consists of 193,449 facts, which were constructed by extracting top visual concepts for all images in the dataset and querying for those concepts in the knowledge bases, WebChild tandon2014webchild , ConceptNet speer2017conceptnet , and DBPedia Auer2007dbpedia .

Retrieval of Relevant Facts: As described in sec:factret, a similarity scoring technique is used to retrieve the top-100 facts for every question. GloVe 100d embeddings are used to represent each word in the fact and question. An initial stop-word removal is performed to remove stop words (such as “what,” “where,” “the”) from the question. To assign a similarity score to each fact, we compute the word-wise cosine similarity of the GloVe embedding of every word in the fact with the words in the question and the detected visual concepts. We choose the top K% of the words in the fact with the highest similarity and average these values to assign a similarity score to the fact. Empirically we found to give the best result. The facts are sorted based on the similarity and the 100 highest scoring facts are filtered. table:ans_ret2 shows that the ground truth fact is present in the top-100 retrieved facts 84.8% of the time and is retrieved as the top-1 fact 22.5% of the time. The numbers reported are an average over the five test sets. We also varied the number of facts retrieved in the first stage and report the recall and downstream accuracy in table:ans_ret2. The recall @50 (76.5%) is lower than the recall @100 (84.8%), which causes the final accuracy of the model to drop to 58.93%. When we retrieve 150 facts, recall is 88.4% and final accuracy is 68.23%, which is slightly below the final accuracy when retrieving 100 facts (69.35%). The final accuracy further drops as we increase the number of retrieved facts to 200 and 500.

Predicting the relation: As described earlier, we use the network proposed in narasimhan2018straight to determine the relation given a question. Using this approach, the Top-1 and Top-3 accuracy for relation prediction are 75.4% and 91.97% respectively.

Sub-component Error % @1
Fact-retrieval 15.20
Relation prediction 9.4
Answer prediction(GCN) 6.05
Total error 30.65
Table 3: Error contribution of the sub-components of the model to the total Top-1 error (30.65%).

Predicting the Correct Answer: sec:answerpred explains in detail the model used to predict an answer from the set of candidate entities . Each node of the graph is represented by the concatenation of the image, question, and entity embeddings. The image embedding is a multi-hot vector of size 1176, indicating the presence of a visual concept in the image. The LSTM to compute the question embedding

is initialized with GloVe 100d embeddings for each of the words in the question. Batch normalization and a dropout of 0.5 is applied after both the embedding layer and the LSTM layer. The question embedding is given by the hidden layer of the LSTM and is of size 128. Each entity

is also represented by a 128 dimensional vector which is computed by an LSTM operating on the words of the entity . The concatenated vector has a dimension of 1429 (, 1176+128+128).

For each question, the feature matrix is constructed from the node representations . The adjacency matrix denotes the edges between the nodes. It is constructed by using the Top-1 or Top-3 relations predicted in sec:factret. The adjacency matrix is of size as the set has at most 200 unique entities (, 2 entities per fact and 100 facts per question). The GCN consists of 2 hidden layers, each operating on 200 nodes, and each node is represented by a feature vector of size 512. The representations of each node from the second hidden layer, ,

are used as input for a multi-layer perceptron which has 512 input nodes and 128 hidden nodes. The output of the hidden nodes is passed to a binary classifier that predicts 0 if the entity is not the answer and 1 if it is. The model is trained end-to-end over 100 epochs with batch gradient descent (Adam optimizer) using cross-entropy loss for each node. Batch normalization and a dropout of 0.5 was applied after each layer. The activation function used throughout is ReLU.

To prove the effectiveness of our model, we show six ablation studies in table:answer_results. Q, VC, Entity denote question, visual concept, and entity embeddings respectively. ‘11’ is the model discussed in sec:main where the entities are first filtered by the predicted relation and each node of the graph is represented by a concatenation of the question, visual concept, and entity embeddings. ‘12’ uses the top three relations predicted by the question-relation LSTM net and retains all the entities which are connected by these three relations. ‘13’ uses the ground truth relation for every question.

To validate the approach we construct some additional baselines. In ‘1,’ each node is represented using only the question and the entity embeddings and the entities are not filtered by relation. Instead, all the entities in are fed to the MLP. ‘2’ additionally filters based on relation. ‘3’ introduces a 2-layer GCN before the MLP. ‘4’ is the same as ‘1’ except each node is now represented using question, entity and visual concept embeddings. ‘5’ filters by relation and skips the GCN by feeding the entity representations directly to the MLP. ‘6’ skips the MLP and the output nodes of the GCN are directly classified using a binary classifier. We observe that there is a significant improvement in performance when we include the visual concept features in addition to question and entity embeddings, thus highlighting the importance of the visual concepts. Without visual concepts, the facts retrieved in the first step have low recall which in turn reduces the downstream test accuracy.

We also report the top-1 and top-3 accuracy obtained by varying the number of layers in the GCN. With 3 layers (‘9’ and ‘10’), our model overfits, causing the test accuracy to drop to 60.78%. With 1 layer (‘7’ and ‘8’), the accuracy is 57.89% and we hypothesize that this is due to the limited information exchange that occurs with one GCN layer. We observe a correlation between the sparsity of the adjacency matrix and the performance of the 1 layer GCN model. When the number of facts retrieved is large and the matrix is less sparse, the 1 layer GCN model makes a wrong prediction. This indicates that the 2nd layer of the GCN allows for more message passing and provides a stronger signal when there are many facts to analyze.

Figure 3: Visual Concepts (VCs) detected by our model. For each image we detect objects, scenes, and actions. We observe the supporting facts to have strong alignment with the VCs which proves the effectiveness of including VCs in our model.

We compare the accuracy of our model with the FVQA baselines and our previous work, STTF in table:answer_results. The accuracy reported here is averaged over all five train-test splits. As shown, our best model ‘13’ outperforms the state-of-the-art STTF technique by more than 7% and the FVQA baseline without ensemble by over 12%. Note that combining GCN and MLP clearly outperforms usage of only one part. FVQA and STTF both try to predict the ground truth fact. If the fact is predicted incorrectly, the answer will also be wrong, thus causing the model to fail. Our method circumvents predicting the fact and instead uses multiple relevant facts to predict the answer. This approach clearly works better.

Synonyms and homographs: Here we show the improvements of our model compared to the baseline with respect to synonyms and homographs. To retrieve the top 100 facts, we use trainable word embeddings which are known to group synonyms and separate homographs.

We ran additional tests using Wordnet to determine the number of question-fact pairs which contain synonyms. The test data contains 1105 such pairs out of which our model predicts 95.38% correctly, whereas the FVQA and STTF models predict 78% and the 91.6% correctly. In addition, we manually generated 100 synonymous questions by replacing words in the questions with synonyms (“What in the bowl can you eat?", is rephrased as, “What in the bowl is edible?"). Tests on these 100 new samples find that our model predicts 91 of these correctly, whereas the key-word matching FVQA technique gets only 61 of these right. As STTF also uses GloVe embeddings, it gets 89 correct. With regards to homographs, the test set has 998 questions which contain words that have multiple meanings across facts. Our model predicts correct answers for 81.16%, whereas the FVQA model and STTF model get 66.33% and 79.4% correct, respectively.

Qualitative results: As described in sec:answerpred, the image embedding is constructed based on the visual concepts detected in the image. fig:vcs shows the object, scene, and action detection for two examples in our dataset. We also indicate the question corresponding to the image, the supporting fact, relation, and answer detected by our model. Using the high-level features helps summarize the salient content in the image as the facts are closely related to the visual concepts. We observe our model to work well even when the question does not focus on the main visual concept in the image. table:answer_results shows that including the visual concept improves the accuracy of our model by nearly 20%.

fig:preds-2 depicts a few success and failure examples of our method. In our model, predicting the correct answer involves three main steps: (1) Selecting the right supporting fact in the Top-100 facts, ; (2) Predicting the right relation; (3) Selecting the right entity in the GCN. In the top two rows of examples, our model correctly executes all the three steps. As shown, our model works for visual concepts of all three types, , actions, scenes and objects. Examples in the second row indicates that our model works well with synonyms and homographs as we use GloVe embeddings of words. The second example in the second row shows that our method obtains the right answer even when the question and the fact do not have many words in common. This is due to the comparison with visual concepts while retrieving the facts.

Figure 4: Success and failure cases: Success cases are shown in the top two rows. Our method correctly predicts the relation, visual concept, and the answer. The bottom row shows three different failure cases.

The last row shows failure cases. Our method fails if any of the three steps produce incorrect output. In the first example the ground-truth fact (Airplane, UsedFor, Flying) isn’t part of the top-100. This happens when words in the fact are neither related to the words in the question nor the list of visual concepts. A second failure mode is due to wrong node/entity predictions (selecting laptop instead of keyboard), , because a similar fact, (Laptop, UsedFor, Data processing) exists. These type of errors are rare (table:error_results) and happen only when the fact space contains a fact similar to the ground truth one. The third failure mode is due to relation prediction accuracies which are around 75%, and 92% for Top-1 and Top-3 respectively, as shown in narasimhan2018straight .

5 Conclusions

We developed a method for ‘reasoning’ in factual visual question answering using graph convolution nets. We showed that our proposed algorithm outperforms existing baselines by a large margin of 7%. We attribute these improvements to ‘joint reasoning about answers,’ which facilitates sharing of information before making an informed decision. Further, we achieve this high increase in performance by using only the ground truth relation and answer information, with no reliance on the ground truth fact. Currently, all the components of our model except for fact retrieval are trainable end-to-end. In the future, we plan to extend our network to incorporate this step into a unified framework.

Acknowledgments: This material is based upon work supported in part by the National Science Foundation under Grant No. 1718221, Samsung, 3M, and the IBM-ILLINOIS Center for Cognitive Computing Systems Research (C3SR). We thank NVIDIA for providing the GPUs used for this research. We also thank Arun Mallya and Aditya Deshpande for their help.


  • (1) J. Andreas, M. Rohrbach, T. Darrell, and D. Klein. Deep compositional question answering with neural module networks. In CVPR, 2016.
  • (2) S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. L. Zitnick, and D. Parikh. VQA: Visual Question Answering. In ICCV, 2015.
  • (3) S. Auer, C. Bizer, G. Kobilarov, J. Lehmann, R. Cyganiak, and Z. Ives. Dbpedia: A nucleus for a web of open data. In ISWC/ASWC, 2007.
  • (4) H. Ben-younes, R. Cadene, M. Cord, and N. Thome. Mutan: Multimodal tucker fusion for visual question answering. In ICCV, 2017.
  • (5) J. Berant, A. Chou, R. Frostig, and P. Liang. Semantic Parsing on Freebase from Question-Answer Pairs. In EMNLP, 2013.
  • (6) J. Berant and P. Liang. Semantic parsing via paraphrasing. In ACL, 2014.
  • (7) A. Bordes, S. Chopra, and J. Weston. Question answering with sub-graph embeddings. In EMNLP, 2014.
  • (8) A. Bordes, N. Usunier, S. Chopra, and J. Weston. Large-scale simple question answering with memory networks. In ICLR, 2015.
  • (9) A. Bordes, J. Weston, and N. Usunier. Open question answering with weakly supervised embedding models. In ECML, 2014.
  • (10) Q. Cai and A. Yates.

    Large-scale Semantic Parsing via Schema Matching and Lexicon Extension.

    In ACL, 2013.
  • (11) A. Das, H. Agrawal, C. L. Zitnick, D. Parikh, and D. Batra. Human attention in visual question answering: Do humans and deep networks look at the same regions? In EMNLP, 2016.
  • (12) A. Das, S. Kottur, K. Gupta, A. Singh, D. Yadav, J. M. Moura, D. Parikh, and D. Batra. Visual Dialog. In CVPR, 2017.
  • (13) A. Das, S. Kottur, J. M. Moura, S. Lee, and D. Batra. Learning cooperative visual dialog agents with deep reinforcement learning. arXiv:1703.06585, 2017.
  • (14) L. Dong, F. Wei, M. Zhou, and K. Xu.

    Question answering over freebase with multi-column convolutional neural networks.

    In ACL, 2015.
  • (15) A. Fader, L. Zettlemoyer, and O. Etzioni. Open question answering over curated and extracted knowledge bases. In KDD, 2014.
  • (16) A. Fukui, D. H. Park, D. Yang, A. Rohrbach, T. Darrell, and M. Rohrbach. Multimodal compact bilinear pooling for visual question answering and visual grounding. In EMNLP, 2016.
  • (17) H. Gao, J. Mao, J. Zhou, Z. Huang, L. Wang, and W. Xu. Are you talking to a machine? Dataset and Methods for Multilingual Image Question Answering. In NIPS, 2015.
  • (18) D. Gordon, A. Kembhavi, M. Rastegari, J. Redmon, D. Fox, and A. Farhadi. Iqa: Visual question answering in interactive environments. In CVPR, 2018.
  • (19) A. Jabri, A. Joulin, and L. van der Maaten. Revisiting Visual Question Answering Baselines. In ECCV, 2016.
  • (20) U. Jain, S. Lazebnik, and A. G. Schwing. Two can play this Game: Visual Dialog with Discriminative Question Generation and Answering. In CVPR, 2018.
  • (21) U. Jain, Z. Zhang, and A. G. Schwing.

    Creativity: Generating Diverse Questions using Variational Autoencoders.

    In CVPR, 2017.
  • (22) A. Jiang, F. Wang, F. Porikli, and Y. Li. Compositional memory for visual question answering. arXiv:1511.05676, 2015.
  • (23) J. Johnson, B. Hariharan, L. van der Maaten, L. Fei-Fei, C. L. Zitnick, and R. Girshick. Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. In CVPR, 2017.
  • (24) J.-H. Kim, S.-W. L. D.-H. Kwak, M.-O. Heo, J. Kim, J.-W. Ha, and B.-T. Zhang. Multimodal residual learning for visual qa. In NIPS, 2016.
  • (25) T. N. Kipf and M. Welling. Semi-supervised classification with graph convolutional networks. arXiv:1609.02907, 2016.
  • (26) O. Kolomiyets and M.-F. Moens. A survey on question answering technology from an information retrieval perspective. In Information Sciences, 2011.
  • (27) J. Krishnamurthy and T. Kollar. Jointly learning to parse and perceive: Connecting natural language to the physical world. In ACL, 2013.
  • (28) T. Kwiatkowski, E. Choi, Y. Artzi, and L. Zettlemoyer. Scaling semantic parsers with on-the-fly ontology matching. In EMNLP, 2013.
  • (29) Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. IEEE, 1998.
  • (30) Y. Li, N. Duan, B. Zhou, X. Chu, W. Ouyang, and X. Wang. Visual question generation as dual task of visual question answering. arXiv:1709.07192, 2017.
  • (31) P. Liang, M. I. Jordan, and D. Klein. Learning dependency-based compositional semantics. In Computational Linguistics, 2013.
  • (32) J. Lu, J. Yang, D. Batra, and D. Parikh. Hierarchical question-image co-attention for visual question answering. In NIPS, 2016.
  • (33) L. Ma, Z. Lu, and H. Li. Learning to answer questions from image using convolutional neural network. In AAAI, 2016.
  • (34) M. Malinowski and M. Fritz. A Multi-World Approach to Question Answering about Real-World Scenes based on Uncertain Input. In NIPS, 2014.
  • (35) M. Malinowski, M. Rohrbach, and M. Fritz.

    Ask your neurons: A neural-based approach to answering questions about images.

    In ICCV, 2015.
  • (36) N. Mostafazadeh, I. Misra, J. Devlin, M. Mitchell, X. He, and L. Vanderwende. Generating natural questions about an image. arXiv:1603.06059, 2016.
  • (37) K. Narasimhan, A. Yala, and R. Barzilay.

    Improving information extraction by acquiring external evidence with reinforcement learning.

    In EMNLP, 2016.
  • (38) M. Narasimhan and A. G. Schwing. Straight to the facts: Learning knowledge base retrieval for factual visual question answering. In ECCV, 2018.
  • (39) J. Pennington, R. Socher, and C. D. Manning. Glove: Global vectors for word representation. In EMNLP, 2014.
  • (40) S. Reddy, O. Täckström, M. Collins, T. Kwiatkowski, D. Das, M. Steedman, and M. Lapata. Transforming dependency structures to logical forms for semantic parsing. In ACL, 2016.
  • (41) M. Ren, R. Kiros, and R. Zemel. Exploring models and data for image question answering. In NIPS, 2015.
  • (42) M. Schlichtkrull, T. N. Kipf, P. Bloem, R. v. d. Berg, I. Titov, and M. Welling. Modeling relational data with graph convolutional networks. In ESWC, 2018.
  • (43) I. Schwartz, A. G. Schwing, and T. Hazan.

    High-Order Attention Models for Visual Question Answering.

    In NIPS, 2017.
  • (44) K. J. Shih, S. Singh, and D. Hoiem. Where to look: Focus regions for visual question answering. In CVPR, 2016.
  • (45) R. Speer, J. Chin, and C. Havasi. Conceptnet 5.5: An open multilingual graph of general knowledge. In AAAI, 2017.
  • (46) S. W. t. Yih, M.-W. Chang, X. He, and J. Gao. Semantic parsing via staged query graph generation: Question answering with knowledge base. In ACL-IJCNLP, 2015.
  • (47) N. Tandon, G. de Melo, F. Suchanek, and G. Weikum. Webchild: Harvesting and organizing commonsense knowledge from the web. In WSDM, 2014.
  • (48) C. Unger, L. Bühmann, J. Lehmann, A.-C. N. Ngomo, D. Gerber, and P. Cimiano. Template-based question answering over RDF data. In WWW, 2012.
  • (49) A. K. Vijayakumar, M. Cogswell, R. R. Selvaraju, Q. Sun, S. Lee, D. Crandall, and D. Batra. Diverse beam search: Decoding diverse solutions from neural sequence models. arXiv:1610.02424, 2016.
  • (50) P. Wang, Q. Wu, C. Shen, A. Dick, and A. v. d. Hengel. Fvqa: Fact-based visual question answering. TPAMI, 2018.
  • (51) P. Wang, Q. Wu, C. Shen, A. Dick, and A. Van Den Henge. Explicit knowledge-based reasoning for visual question answering. In IJCAI, 2017.
  • (52) X. Wang, Y. Ye, and A. Gupta. Zero-shot recognition via semantic embeddings and knowledge graphs. In CVPR, 2018.
  • (53) Q. Wu, C. Shen, A. van den Hengel, P. Wang, and A. Dick. Image captioning and visual question answering based on attributes and their related external knowledge. arXiv:1603.02814, 2016.
  • (54) Q. Wu, P. Wang, C. Shen, A. Dick, and A. van den Hengel. Ask me anything: Free-form visual question answering based on knowledge from external sources. In CVPR, 2016.
  • (55) C. Xiao, M. Dymetman, and C. Gardent. Sequence-based structured prediction for semantic parsing. In ACL, 2016.
  • (56) C. Xiong, S. Merity, and R. Socher. Dynamic memory networks for visual and textual question answering. In ICML, 2016.
  • (57) H. Xu and K. Saenko. Ask, attend and answer: Exploring question-guided spatial attention for visual question answering. In ECCV, 2016.
  • (58) X.Yao and B. V. Durme. Information extraction over structured data: Question answering with Freebase. In ACL, 2014.
  • (59) Z. Yang, X. He, J. Gao, L. Deng, and A. Smola. Stacked attention networks for image question answering. In CVPR, 2016.
  • (60) L. Yu, E. Park, A. Berg, and T. Berg. Visual madlibs: Fill in the blank image generation and question answering. In ICCV, 2015.
  • (61) L. S. Zettlemoyer and M. Collins. Learning context-dependent mappings from sentences to logical form. In ACL, 2005.
  • (62) L. S. Zettlemoyer and M. Collins. Learning to map sentences to logical form: Structured classification with probabilistic categorial grammars. In UAI, 2005.
  • (63) P. Zhang, Y. Goyal, D. Summers-Stay, D. Batra, and D. Parikh. Yin and yang: Balancing and answering binary visual questions. arXiv:1511.05099, 2015.
  • (64) Y. Zhang, K. Liu, S. He, G. Ji, Z. Liu, H. Wu, and J. Zhao. Question answering over knowledge base with neural attention combining global knowledge information. arXiv:1606.00979, 2016.
  • (65) B. Zhou, Y. Tian, S. Sukhbataar, A. Szlam, and R. Fergus. Simple baseline for visual question answering. arXiv:1512.02167, 2015.
  • (66) Y. Zhu, O. Groth, M. Bernstein, and L. Fei-Fei. Visual7W: Grounded Question Answering in Images. In CVPR, 2016.
  • (67) Y. Zhu, C. Zhang, C. Ré, and L. Fei-Fei. Building a large-scale multimodal Knowledge Base for Visual Question Answering. In CoRR, 2015.
  • (68) C. L. Zitnick, A. Agrawal, S. Antol, M. Mitchell, D. Batra, and D. Parikh. Measuring machine intelligence through visual question answering. AI Magazine, 2016.