Graphhopper: Multi-Hop Scene Graph Reasoning for Visual Question Answering

07/13/2021 ∙ by Rajat Koner, et al. ∙ 0

Visual Question Answering (VQA) is concerned with answering free-form questions about an image. Since it requires a deep semantic and linguistic understanding of the question and the ability to associate it with various objects that are present in the image, it is an ambitious task and requires multi-modal reasoning from both computer vision and natural language processing. We propose Graphhopper, a novel method that approaches the task by integrating knowledge graph reasoning, computer vision, and natural language processing techniques. Concretely, our method is based on performing context-driven, sequential reasoning based on the scene entities and their semantic and spatial relationships. As a first step, we derive a scene graph that describes the objects in the image, as well as their attributes and their mutual relationships. Subsequently, a reinforcement learning agent is trained to autonomously navigate in a multi-hop manner over the extracted scene graph to generate reasoning paths, which are the basis for deriving answers. We conduct an experimental study on the challenging dataset GQA, based on both manually curated and automatically generated scene graphs. Our results show that we keep up with a human performance on manually curated scene graphs. Moreover, we find that Graphhopper outperforms another state-of-the-art scene graph reasoning model on both manually curated and automatically generated scene graphs by a significant margin.



There are no comments yet.


page 2

page 12

page 13

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1: Example of an image and the corresponding scene graph. Since the scene graph is a directed graph with typed edges, it resembles a knowledge graph and permits the application of knowledge-base completion techniques.

Visual Question Answering (VQA) is a challenging task that involves understanding and reasoning over two data modalities, i.e., images and natural language. Given an image and a free-form question which formulates a query about the presented scene — the issue is for the algorithm to find the correct answer.

VQA has been studied from the perspective of scene and knowledge graphs [33, 6], as well as vision-language reasoning [10, 1]. To study VQA, various real-world data sets, such as the VQA data set [4, 24], have been generated. It has been argued that, in the VQA data set, many of the apparently challenging reasoning tasks can be solved by an algorithm through exploiting trivial prior knowledge, and thus by shortcuts to proper reasoning (e.g., clouds are white or doors are made of wood). To address these shortcomings, the GQA dataset [16] has been developed. Compared to other real-world datasets, GQA is more suitable for evaluating reasoning abilities since the images and questions are carefully filtered to make the data less prone to biases.

Plenty of VQA approaches are agnostic towards the explicit relational structure of the objects in the presented scene and rely on monolithic neural network architectures that process regional features of the image separately

[2, 39]. While these methods led to promising results on previous datasets, they lack explicit compositional reasoning abilities, which results in weaker performance on more challenging datasets such as GQA. Other works [34, 31, 17] perform reasoning on explicitly detected objects and interactive semantic and spatial relationships among them. These approaches are closely related to the scene graph representations [19] of an image, where detected objects are labeled as nodes and relationships between the objects are labeled as edges. In this work, we aim to combine VQA techniques with recent research advances in the area of statistical relation learning on knowledge graphs (KGs). KGs provide human-understandable, structured representations of knowledge about the real world via collections of factual statements. Inspired by multi-hop reasoning methods on KGs such as [8, 38, 12], we propose Graphhopper, a novel method that models the VQA task as a path-finding problem on scene graphs. The underlying idea can be summarized with the phrase: Learn to walk to the correct answer. More specifically, given an image, we consider a scene graph and train a reinforcement learning agent to conduct a policy-guided random walk on the scene graph until a conclusive inference path is obtained. In contrast to purely embedding-based approaches, our method provides explicit reasoning chains that lead to the derived answers. To sum up, our major contributions are as follows.

  • Graphhopper is the first VQA method that employs reinforcement learning for multi-hop reasoning on scene graphs.

  • We conduct a thorough experimental study on the challenging VQA dataset named QGA to show the compositional and interpretable nature of our model.

  • To analyze the reasoning capabilities of our method, we consider manually curated (ground truth) scene graphs. This setting isolates the noise associated with the visual perception task and focuses solely on the language understanding and reasoning task. Thereby, we can show that our method achieves human-like performance.

  • Based on both the manually curated scene graphs and our own automatically generated scene graphs, we show that Graphhopper outperforms the Neural State Machine (NMS), a state-of-the-art scene graph reasoning model that operates in a setting, similar to Graphhopper.

Moreover, we are the first group to conduct experiments and publish the code on generated scene graphs for the GQA dataset111Code is available at : remainder of this work is organized as follows. We review related literature in the next section. Section 3 introduces the notation and describes the methodology of Graphhopper. Section 4 and Section 5 detail an experimental study on the benchmark dataset GQA. Furthermore, through a rigorous study using both manually-curated ground-truth and generated scene graphs, we examine the reasoning capabilities of Graphhopper. We conclude in Section 6.

2 Related Work

Visual Question Answering:

Various models have been proposed that perform VQA on both real-world [4, 16] and artificial datasets [18]

. Currently, leading VQA approaches can be categorized into two different branches: First, monolithic neural networks, which perform implicit reasoning on latent representations obtained from fusing the two data modalities. Second, multi-hop methods that form explicit symbolic reasoning chains on a structured representation of the data. Monolithic network architectures obtain visual features from the image either in the form of individual detected objects or by processing the whole image directly via convolutional neural networks (CNNs). The derived embeddings are usually scored against a fixed answer set along with the embedding of the question obtained from a sequence model. Moreover, co-attention mechanisms are frequently employed to couple the vision and the language models allowing for interactions between objects from both modalities

[20, 2, 5, 40, 41]. Monolithic networks are among the dominant methods on previous real-world VQA datasets such as [4]. However, they suffer from the black-box problem and possess limited reasoning capabilities with respect to complex questions that require long reasoning chains (see [7] for a detailed discussion).

Explicit reasoning methods combine the sub-symbolic representation learning paradigm with symbolic reasoning approaches over structured representations of the image. Most of the popular explicit reasoning approaches follow the idea of neural module networks (NMNs) [3] which perform a sequence of reasoning steps realized by forward passes through specialized neural networks that each correspond to predefined reasoning subtasks. Thereby, NMNs construct functional programs by dynamically assembling the modules resulting in a question-specific neural network architecture. In contrast to the monolithic neural network architectures described above, these methods contain a natural transparency mechanism via functional programs. However, while NMN-related methods (e.g., [14, 26]) exhibit good performance on synthetic datasets such as CLEVR [18], they require functional module layouts as additional supervision signals to obtain good results. Closely related to our method is the Neural State Machine (NSM) proposed by [15]. NSM’s underlying idea consists of first constructing a scene graph from an image and treating it as a state machine. Concretely, the nodes correspond to states and edges to transitions. Then, conditioned on the question, a sequence of instructions is derived that indicates how to traverse the scene graph and arrive at the answer. In contrast to NSM, we treat path-finding as a decision problem in a reinforcement learning setting. Concretely, we outline in the next section how extracting predictive paths from scene graphs can be naturally formulated in terms of a goal-oriented random walk induced by a stochastic policy that allows the approach to balance between exploration and exploitation. Moreover, our framework integrates state-of-the-art techniques from graph representation learning and NLP. This paper only considers basic policy gradient methods, but more sophisticated reinforcement learning techniques will be employed in future works.

Statistical Relational Learning:

Machine learning methods for KG reasoning aim at exploiting statistical regularities in observed connectivity patterns. These methods are studied under the umbrella of statistical relational learning (SRL) [27]

. In recent years, KG embeddings have become the dominant approach in SRL. The underlying idea is that graph features that explain the connectivity pattern of KGs can be encoded in low-dimensional vector spaces. In the embedding spaces, the interactions among the embeddings for entities and relations can be efficiently modeled to produce scores that predict the validity of a triple. Despite achieving good results in KG reasoning tasks, most embedding-based methods have problems capturing the compositionality expressed by long reasoning chains. This often limits their applicability in complex reasoning tasks. Recently, multi-hop reasoning methods such as MINERVA

[8] and DeepPath [38]

were proposed. Both methods are based on the idea that a reinforcement learning agent is trained to perform a policy-guided random walk until the answer entity to a query is reached. Thereby, the path finding problem of the agent can be modeled in terms of a sequential decision making task framed as a Markov decision process (MDP). The method that we propose in this work follows a similar philosophy, in the sense that we train an RL agent to navigate on a scene graph to the correct answer node. However, a conceptual difference is that the agents in MINERVA and DeepPath perform walks on large-scale knowledge graphs exploiting repeating statistical patterns. Thereby, the policies implicitly incorporate approximate rules. In addition, instead of free-form processing questions, the query in the KG reasoning setting is structured as a pair of symbolic entities. That is why we propose a wide range of modifications to adjust our method to the challenging VQA setting.

3 Method

The task of VQA is framed as a scene graph traversal problem. Starting from a hub node that is connected to all other nodes, an agent sequentially samples transitions to neighboring nodes on the scene graph until the node corresponding to the answer is reached. In this way, by adding transitions to the current path, the reasoning chain is successively extended. Before describing the decision problem of the agent, we introduce the notation that we use throughout this work.


A scene graph is a directed multigraph where each node corresponds to a scene entity which is either an object associated with a bounding box or an attribute of an object. Each scene entity comes with a type that corresponds to the predicted object or attribute label. Typed edges specify how scene entities are related to each other. More formally, let denote the set of scene entities and consider the set of binary relations . Then a scene graph is a collection of ordered triples – subject, predicate, and object. For example, as shown in Figure 1, the triple (motorcycle-1, has_part, tire-1) indicates that both a motorcycle (subject) and a tire (object) are detected in the image. The predicate has_part indicates the relation between the entities. Moreover, we denote with the inverse relation corresponding to the predicate . For the remainder of this work, we impose completeness with respect to inverse relations in the sense that for every it is implied that .

Figure 2: The architecture of our scene graph reasoning module.


The state space of the agent is given by where are the nodes of a scene graph and denotes the set of all questions. The state at time is the entity at which the agent is currently located and the question . Thus, a state for time is represented by . The set of available actions from a state is denoted by . It contains all outgoing edges from the node together with their corresponding object nodes. More formally, Moreover, we denote with the action that the agent performed at time . We include self-loops for each node in that produce a NO_OP-label. These self-loops allow the agent to remain at the current location if it reaches the answer node. Furthermore, the introduction of inverse relations allows agent to transit freely in any direction between two nodes.

The environments evolve deterministically by updating the state according to previous action. Formally, the transition function at time is given by with and .

Auxiliary Nodes : In addition to standard entity relation nodes present in a scene graph, we introduce a few auxiliary nodes (e.g. hub node). The underlying rationale for the inclusion of auxiliary nodes is that they facilitate the walk for the agent or help to frame the QA-task as a goal-oriented walk on the scene graph. These additional nodes are included during run-time graph traversal, but they are ignored during the compile time such as when computing node embedding. For example, we add a hub node (hub) to every scene graph which is connected to all other nodes. The agent then starts the scene graph traversal from a hub with global connectivity. Furthermore for a binary question, we add YES and NO nodes to the scene entities that correspond to the final location of the agent. The agent can then transition to either the YES or the NO node.

Question and Scene Graph Processing

We initialize words in with GloVe embeddings [29] with dimension . Similarly we initialize entities and relations in with the embeddings of their type labels. In the scene graph, the node embeddings are passed through a multi-layered graph attention network (GAT) [36]. Extending the idea from graph convolutional networks [22] with a self-attention mechanism, GATs mimic the convolution operator on regular grids where an entity embedding is formed by aggregating node features from its neighbors. Relations and inverse relations between nodes allows context to flow in both ways through GAT. Thus, the resulting embeddings are context-aware, which makes nodes with the same type, but different graph neighborhoods, distinguishable. To produce an embedding for the question , we first apply a Transformer [35], followed by a mean pooling operation.

Finally, since we added auxiliary YES and NO

nodes to the scene graph for binary questions, we train a feedforward neural network to classify query-type (i.e., questions that query for an object in the depicted scene) and binary questions. This network consists of two fully connected layers with ReLU activation on the intermediate output. We find that it is easy to distingquish between query and binary questions (e.g., query questions usually begin with

What, Which, How, etc., whereas binary questions usually begin with Do, Is, etc.). Since our classifier achieves 99.99% accuracy we will ignore the error in question classification in the following discussions.


We denote the agent’s history until time with the tuple for and along with for . The history is encoded via a multilayered LSTM [13]


where corresponds to the embedding of the previous action with and denoting the embeddings of the edge and the target node into , respectively. The history-dependent action distribution is given by


where the rows of contain latent representations of all admissible actions. Moreover, encodes the question . The action is drawn according to . Equations (1) and (2) induce a stochastic policy , where denotes the set of trainable parameters.

Rewards and Optimization

After sampling transitions, a terminal reward is assigned according to


We employ REINFORCE [37] to maximize the expected rewards. Thus, the agent’s maximization problem is given by


where denote the set of training questions. During training the first expectation in Equation (4

) is substituted with the empirical average over the training set. The second expectation is approximated by the empirical average over multiple rollouts. We also employ a moving average baseline to reduce the variance. Further, we use entropy regularization with parameter

to enforce exploration. During inference, we do not sample paths but perform a beam search with width 20 based on the transition probabilities given by Equation (


Additional details on the model, the training and the inference procedure along with sketches of the algorithms, and a complexity analysis can be found in the supplementary material.

4 Dataset and Experimental Setup

In this section we introduce the dataset and detail the experimental protocol.

4.1 Dataset

The GQA dataset [16] has been introduced with the goal of addressing key shortcomings of previous VQA datasets, such as CLEVR [18] or the VQA dataset [4]. GQA is more suitable for evaluating the reasoning and compositional abilities of a model in a realistic setting. It contains 113K images, and around 1.2M questions split into roughly for the training, validation, and testing. The overall vocabulary size consists of 3097 words, including 1702 object classes, 310 relationships, and 610 object attributes.
Due to the large number of objects and relationships present in GQA, we used a pruned version of the dataset (see Section 5) for our generated scene graph. In this work, we have conducted two primary experiments. First, we report the results on manually curated scene graphs provided in the GQA

dataset. In this setting, the true reasoning and language understanding capabilities of our model can be analyzed. Afterward, we evaluate the performance of our model with the generated scene graphs on pruned GQA dataset. It shows the performance of our model on noisy generated data. We have used state of the art Relation Transformer Network (RTN)

[23] for the scene graph generation and DetectoRS [30] for object detection. We have conducted all the experiments on “test-dev” split of the GQA.

Question Types:

The questions are designed to evaluate the reasoning abilities such as visual verification, relational reasoning, spatial reasoning, comparison, and logical reasoning. These questions can be categorized either according to structural or semantic criteria. An overview of the different question types is given in supplementary (see Table 4).

4.2 Experimental Setup

Scene Graph Reasoning:

Regarding the model parameters, we apply 300 dimensional GloVe embeddings to both the questions and the graphs (i.e., edges and nodes). Moreover, we employ a two-layer GAT [36] model. The dropout [32] probability of each layer is set to 0.1. The first layer has eight attention heads. Each head has eight latent features which are concatenated to form the output features of that layer. The output layer has eight attention heads with mean aggregation, so that the output also has 300-dimensional features. We apply dropout with to the attention coefficients at each layer. This essentially means that each node is exposed to a stochastically sampled neighborhood during training. Moreover, we employ a two-layer Transformer [35] decoder model. The model dimension is set to 300, and the key and query dimensions are both set to 64 with dropout . The LSTM of the policy networks consists of a uni-directional layer with hidden size 300. Finally, the agent performs a fixed number of transitions. In question answering, most questions concern one subject to be explored within one reasoning path originated from the start node. Hence, we set the maximum number of steps to 4, without resetting. By contrast, the binary questions have 8 steps and a reset frequency of 4. In other words, the agent is prompted to the hub node after the fourth step.

Training the Graphhopper:

In terms of the training procedure, the GAT, the Transformer, and the policy networks are initialized with Glorot [11] initialization. We train our model with data from the val_balanced_questions tier. We use a batch size of 64 and sample a batch of questions along with their associated graphs. We collect 20 stochastic rollouts for each question performed in a vectorized form to utilize parallel computation. For each batch, we collect the rewards when a complete forward pass is done. Then the gradients are approximated from the rewards and applied to update the weights. We employ the Adam optimizer [21] with a learning rate of for all trainable weights. The coefficient for the action entropy, which balances exploration and exploitation, starts from 0.2 and decreases exponentially at each step with a factor 0.99.

Next to other standard Python libraries, we mainly employed PyTorch


. All experiments were conducted on a machine with one NVIDIA RTX 2080 Ti GPU and 64 GB RAM. Training the scene graph reasoner of Graphhopper for 40 epochs on

GQA takes around 10 hours, testing about 1 hour.

4.3 Performance Metrics

Along with the accuracy (i.e., Hits@1) on open questions (“Open”), binary questions (yes/no) (“Binary”), and the overall accuracy (“Accuracy”), we also report the additional metric “Consistency” (answers should not contradict themselves), “Validity” (answers are in the range of a question; e.g., red is a valid answer when asked for the color of an object), “Plausibility” (answers should be reasonable; e.g., red is a reasonable color of an apple reasonable, blue is not), as proposed in [16].

5 Results and Discussion

As outlined before, VQA is a challenging task, and there is still a significant performance gap between state-of-the-art VQA methods and human performance on challenging, real-world datasets such as GQA (see [16]). Similar to other existing methods, our architecture involves multiple components, and it important to be able to analyse the performance of the different modules and processing steps in isolation. Therefore we first present the results of our experiments on manually curated, ground-truth scene graphs provided in the GQA dataset and compare the performance of Graphhopper against NSM and humans. This setting allows us to isolate the noise from the visual perception component and quantify our methods’ reasoning capabilities. Subsequently, we present the results with our own generated scene graphs.

In addition, we also observed that the inclusion of auxiliary nodes helps the agent to achieve efficient performance. Hub

node performs better compare to starting from any random nodes, as its facilitate easier forward and backtracking from a node. For binary question instead of YES or NO node, we experimented where the path of the agent was processed by another classifier (e.g., a logistic regression) and the classification logits were assigned as rewards. However, this led to inferior results; most likely due to the absence of a weight-sharing mechanism and due to the noisy reward signal produced by the classifier. These observations supports our assumption on the role of auxiliary nodes we have used in scene graph.

Reproducing NSM:


proposed the state of the art method named NSM for VQA. NSM is the conceptually most similar method, as it also exploits the scene graph reasoning for VQA. We consider NSM to be our baseline method for comparison. However, their approach to reasoning is different from ours. To compare the reasoning ability of our method with the same generated scene graph, we tried to reproduce NSM, as the code for NSM is not open-sourced. We have used the the available parameters from

[17] and the implementation from [9].

Method Binary Open Consistency Validity Plausibility Accuracy
Human [16] 91.2 87.4 98.4 98.9 97.2 89.3
NSM [17] 51.03 18.79 81.36 83.69 79.12 34.5
Graphhopper 92.18 92.40 91.92 93.68 93.13 92.30
Table 1: A comparison of Graphhopper with human performance and NSM based on manually curated scene graphs.

5.1 Results on Manually Curated Scene Graphs

In this section, we report on an experimental study with Graphhopper on the manually curated scene graphs provided along with the GQA dataset. Table 1 shows the performance of Graphhopper and compares it with the human performance reported in [16] and with the performance of NSM on the same underlying manually curated scene graphs. We find that Graphhopper strictly outperforms NSM with respect to all performance measures. In particular, on the open questions, the performance gap is significant. Moreover, Graphhopper also slightly outperforms humans with respect to the accuracy on both types of questions. On the other hand, concerning the supplementary performance measures consistency, validity, and plausibility, Graphhopper is outperformed by humans but nevertheless consistently reaches high values. Overall, these results can be seen as a testament of the reasoning capabilities and establish an upper bound to the performance of Graphhopper.

5.2 Results on automatically generated graph

The process of generating a graph representation for visual data is a costly and complex procedure. Although the scene graph generation is not the main focus of this work, it constituted one of the major challenges to create good scene graph for GQA due to the following facts:

  • There is no open source code for GQA scene graph generation or object detection.

  • A large number of instances and an uneven class distribution in GQA leads to a significant drop in the accuracy compared to existing scene graph datasets (see [24]).

  • There is a lack of attribute prediction models in modern object detection frameworks.

In this work, we adress all of these challenges as our model’s performance is directly dependent on the quality of the scene graph. We will also open-source our code base for transparency and accelerate the development scene graph-based reasoning for VQA.

Generation of Scene Graph:

To address these problems, first, we choose two state-of-the-art network, RTN [23] for scene graph generation, and DetectoRS [30] for object detection. The transformer [35] based architecture of RTN and its contextual scene graph embedding is most closely related to our architecture and for our future expansion. To make Graphhopper generic to any scene graph generator, we haven’t use contextualized embedding from RTN, instead we rely on GAT for contextualization.

Pruning of GQA:

GQA has more than 6 times the number of relationships compared to Visual Genome [24]

, which is the most used scene graph generation dataset, and contains more than 18 times the number of objects compared to the most common object detection dataset COCO


. Also, the class distribution is highly skewed which causes a significant drop in the accuracy for both the object detection and the scene graph generation task. To efficiently prune the number of instances, we take the first 800 classes, 170 relationships, and 200 attributes based on their frequency of occurrence in the training questions and answers. This pruning allows us to reduce more than

of the words while covering more than of the combined answers in the training set.

Attribute prediction:

One of the shortcomings of existing scene graph generation and object detection networks is that they do not predict the attributes (e.g., the color or size of an object) of a detected object. Therefore, we have incorporated the attribute prediction for answering the question on GQA. Contextualized object embedding from RTN [23] is used for attribute prediction as


where ,, ,

are the weight matrices of a linear layer, the contextual embedding of an object, the probability distribution over all objects and the probability distribution over the attributes.

denotes the sigmoid function.

We have trained both the object detector and the scene graph generator on a pruned version of GQA with their respective default parameters after the prepossessing. This helps to increase the coverage of all the instances (e.g., objects, attributes, relationships ) on training questions from to implying that our generated scene graph now covers of all instances that represent answers to the training questions.

Method Binary Open Consistency Validity Plausibility Accuracy
NSM [17] 51.88 19.83 82.01 86.28 81.75 35.34
Graphhopper 69.48 44.69 83.64 89.42 85.13 56.69
Graphhopper (pr) 85.84 77.27 92.98 92.26 89.50 81.41
Table 2: A comparison of our method with NSM, based on generated scene graphs. Graphhopper (pr) indicates that we employed predicted relations from RTN [23].
(a) Question: Is the color of the number the same as that of the wristband?
Answer: No.
(b) Question: What is the name of the appliance that is not small?
Answer: Refrigerator.
(c) Do both the pepper and the vegetable to the right of the ice cube have green color?
Answer: Yes.
Figure 3: Three examples question and the corresponding images and paths.

[Experiments on Manually Curated Scene Graph]

[Experiments on Ground Truth objects and predicted relation from RTN [23] as Relation predictor. ]

[Experiments on Generated Scene Graph using DetectoRS [30] object detector and RTN [23] as Scene Graph generator. ]

Figure 4: Comparison of the performance of our model on various Scene Graph generation settings, (left) accuracy across various semantic instances (“Attribute”,“Global”,“Relation” etc) required to answer a question (middle) accuracy on multiple types of question category (“Choose”,“Logical”,“Verify” etc) and (right) accuracy on minimum number of steps needed to reach the answer node.

Table 2, shows the performance of Graphhopper in two settings: First, with a generated graph where we predict the classes, the attributes, and relationships using our own pipeline. Second, where we only use the predicted relationships from RTN [23] (with ground truth objects and attributes). We find that Graphhopper consistently outperforms NSM [17] based on the generated graph. Moreover, in the “pr” or predicted relations setting, it achieves an even higher score as the graphs do not contain any misprediction from the object detector. These encouraging results show superior reasoning abilities both on the generated graph and generated relationships between objects.

5.3 Discussion on the Reasoning Ability

To further analyze the reasoning abilities of Graphhopper, Figure 4 disentangles the results according to different types of questions: 5 semantic types (left) and 5 structural types (middle). Moreover, we report the performance of Graphhopper according to the length of the reasoning path (right) (see the supplementary material for additional information). Moreover, we show the performance of Graphhopper separately for each of the three scene graph settings that we considered in this work. Figure 4 shows performance on a manually curated scene graph that depicts the actual performance in an ideal environment. Figure 4 illustrates the performance based on only the predicted relationships between objects. This setting shows the performance of Graphhopper along with a scene graph generator. Finally, Figure 4 depicts the performance based on the object detector, the scene graph generator, and Graphhopper. First and foremost, we find that Graphhopper consistently achieves high accuracy on all types of questions in every setting. Moreover, we find that the performance of Graphhopper does not suffer if answering the questions requires many reasoning steps. We conjecture that this is because high-complexity questions are harder to answer, but due to proper contextualization of the embeddings (e.g., via the GAT and the Transformer), the agent can extract the specific information that identifies the correct target node. The good performance on these high-complexity questions can be seen as evidence that Graphhopper can efficiently translate the question into a transition on the scene graph hopping until the correct answer is reached.

Examples of Reasoning Path:

Figure 3 shows three examples of scene graph traversals of Graphhopper that lead to the correct answer. One can see in these examples that the sequential reasoning process over explicit scene graph entities makes the reasoning process more comprehensible. In the case of wrong predictions, the extracted path may offer insights into the mechanics of Graphhopper and facilitate debugging.

6 Conclusion

We have proposed Graphhopper, a novel method for visual question answering that integrates existing KG reasoning, computer vision, and natural language processing techniques. Concretely, an agent is trained to extract conclusive reasoning paths from scene graphs. To analyze the reasoning abilities of our method, we conducted a rigorous experimental study on both manually curated and generated scene graphs. Based on the manually curated scene graphs we showed that Graphhopper reaches human performance. Moreover, we find that, on our own automatically generated scene graph, Graphhopper outperform another state-of-the-art scene graph reasoning model with respect to all considered performance metrics. In future works, we plan to combine scene graphs with common sense knowledge graphs to further enhance the reasoning abilities of Graphhopper.


  • [1] E. Abbasnejad, D. Teney, A. Parvaneh, J. Shi, and A. v. d. Hengel (2020) Counterfactual vision and language learning. In

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    pp. 10044–10054. Cited by: §1.
  • [2] P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L. Zhang (2018)

    Bottom-up and top-down attention for image captioning and visual question answering

    In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 6077–6086. Cited by: §1, §2.
  • [3] J. Andreas, M. Rohrbach, T. Darrell, and D. Klein (2016) Neural module networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 39–48. Cited by: §2.
  • [4] S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. Lawrence Zitnick, and D. Parikh (2015) Vqa: visual question answering. In Proceedings of the IEEE international conference on computer vision, pp. 2425–2433. Cited by: §1, §2, §4.1.
  • [5] R. Cadene, H. Ben-Younes, M. Cord, and N. Thome (2019) Murel: multimodal relational reasoning for visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1989–1998. Cited by: §2.
  • [6] L. Chen, H. Zhang, J. Xiao, X. He, S. Pu, and S. Chang (2019) Counterfactual critic multi-agent training for scene graph generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4613–4623. Cited by: §1.
  • [7] W. Chen, Z. Gan, L. Li, Y. Cheng, W. Wang, and J. Liu (2019) Meta module network for compositional visual reasoning. arXiv preprint arXiv:1910.03230. Cited by: §2.
  • [8] R. Das, S. Dhuliawala, M. Zaheer, L. Vilnis, I. Durugkar, A. Krishnamurthy, A. Smola, and A. McCallum (2018) Go for a walk and arrive at the answer: reasoning over paths in knowledge bases using reinforcement learning. In ICLR, Cited by: §1, §2.
  • [9] C. Eyzaguirre (2019) NSM. GitHub. Note: Cited by: §5.
  • [10] Z. Gan, Y. Chen, L. Li, C. Zhu, Y. Cheng, and J. Liu (2020) Large-scale adversarial training for vision-and-language representation learning. arXiv preprint arXiv:2006.06195. Cited by: §1.
  • [11] X. Glorot and Y. Bengio (2010) Understanding the difficulty of training deep feedforward neural networks. In

    Proceedings of the thirteenth international conference on artificial intelligence and statistics

    pp. 249–256. Cited by: §4.2.
  • [12] M. Hildebrandt, J. A. Q. Serna, Y. Ma, M. Ringsquandl, M. Joblin, and V. Tresp (2020) Reasoning on knowledge graphs with debate dynamics. arXiv preprint arXiv:2001.00461. Cited by: §1.
  • [13] S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §3.
  • [14] R. Hu, J. Andreas, M. Rohrbach, T. Darrell, and K. Saenko (2017) Learning to reason: end-to-end module networks for visual question answering. In Proceedings of the IEEE International Conference on Computer Vision, pp. 804–813. Cited by: §2.
  • [15] D. A. Hudson and C. D. Manning (2018) Compositional attention networks for machine reasoning. arXiv preprint arXiv:1803.03067. Cited by: §2.
  • [16] D. A. Hudson and C. D. Manning (2019) GQA: a new dataset for real-world visual reasoning and compositional question answering. arXiv preprint arXiv:1902.09506. Cited by: §1, §2, §4.1, §4.3, §5.1, Table 1, §5.
  • [17] D. Hudson and C. D. Manning (2019) Learning by abstraction: the neural state machine. In Advances in Neural Information Processing Systems, pp. 5901–5914. Cited by: §1, §5, §5.2, Table 1, Table 2.
  • [18] J. Johnson, B. Hariharan, L. van der Maaten, L. Fei-Fei, C. Lawrence Zitnick, and R. Girshick (2017) Clevr: a diagnostic dataset for compositional language and elementary visual reasoning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2901–2910. Cited by: §2, §2, §4.1.
  • [19] J. Johnson, R. Krishna, M. Stark, L. Li, D. Shamma, M. Bernstein, and L. Fei-Fei (2015) Image retrieval using scene graphs. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3668–3678. Cited by: §1.
  • [20] J. Kim, J. Jun, and B. Zhang (2018) Bilinear attention networks. In Advances in Neural Information Processing Systems, pp. 1564–1574. Cited by: §2.
  • [21] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §0.A.1, §4.2.
  • [22] T. N. Kipf and M. Welling (2016) Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907. Cited by: §3.
  • [23] R. Koner, P. Sinhamahapatra, and V. Tresp (2020) Relation transformer network. arXiv preprint arXiv:2004.06193. Cited by: §4.1, Figure 4, §5.2, §5.2, §5.2, Table 2.
  • [24] R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. Kalantidis, L. Li, D. A. Shamma, et al. (2017) Visual genome: connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision 123 (1), pp. 32–73. Cited by: §1, 2nd item, §5.2.
  • [25] T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick (2014) Microsoft coco: common objects in context. In European conference on computer vision, pp. 740–755. Cited by: §5.2.
  • [26] J. Mao, C. Gan, P. Kohli, J. B. Tenenbaum, and J. Wu (2019) The neuro-symbolic concept learner: interpreting scenes, words, and sentences from natural supervision. arXiv preprint arXiv:1904.12584. Cited by: §2.
  • [27] M. Nickel, K. Murphy, V. Tresp, and E. Gabrilovich (2015) A review of relational machine learning for knowledge graphs. Proceedings of the IEEE 104 (1), pp. 11–33. Cited by: §2.
  • [28] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer (2017) Automatic differentiation in pytorch. Cited by: §4.2.
  • [29] J. Pennington, R. Socher, and C. D. Manning (2014) GloVe: global vectors for word representation. In Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543. External Links: Link Cited by: §3.
  • [30] S. Qiao, L. Chen, and A. Yuille (2020) DetectoRS: detecting objects with recursive feature pyramid and switchable atrous convolution. arXiv preprint arXiv:2006.02334. Cited by: §4.1, Figure 4, §5.2.
  • [31] J. Shi, H. Zhang, and J. Li (2019) Explainable and explicit visual reasoning over scene graphs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8376–8384. Cited by: §1.
  • [32] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15 (1), pp. 1929–1958. Cited by: §4.2.
  • [33] K. Tang, Y. Niu, J. Huang, J. Shi, and H. Zhang (2020) Unbiased scene graph generation from biased training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3716–3725. Cited by: §1.
  • [34] D. Teney, L. Liu, and A. van Den Hengel (2017) Graph-structured representations for visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–9. Cited by: §1.
  • [35] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §3, §4.2, §5.2.
  • [36] P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Lio, and Y. Bengio (2017) Graph attention networks. arXiv preprint arXiv:1710.10903. Cited by: §3, §4.2.
  • [37] R. J. Williams (1992) Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning 8 (3-4), pp. 229–256. Cited by: §0.A.1, §3.
  • [38] W. Xiong, T. Hoang, and W. Y. Wang (2017-09) DeepPath: a reinforcement learning method for knowledge graph reasoning. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (EMNLP 2017), Copenhagen, Denmark. Cited by: §1, §2.
  • [39] Z. Yang, X. He, J. Gao, L. Deng, and A. Smola (2016) Stacked attention networks for image question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 21–29. Cited by: §1.
  • [40] Z. Yu, J. Yu, J. Fan, and D. Tao (2017) Multi-modal factorized bilinear pooling with co-attention learning for visual question answering. In Proceedings of the IEEE international conference on computer vision, pp. 1821–1830. Cited by: §2.
  • [41] C. Zhu, Y. Zhao, S. Huang, K. Tu, and Y. Ma (2017) Structured attentions for visual question answering. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1291–1300. Cited by: §2.

Appendix 0.A Details on Model Training and Inference

0.a.1 Training Details

In order to optimize the training objective given by Equation (4), we use REINFORCE [37] to obtain the gradient approximation


where is the discount factor for the reward. The gradient of the weights are aggregated over multiple rollouts. To reduce the variance, we adopt a moving average baseline function . The baseline function is an approximation of the value of a state . We could have employed more sophisticated methods such as advantage network or actor-critic algorithm. However, we find the current baseline works sufficiently well. Formally, the baseline function consists of a non-trainable variable

and a hyperparameter

. The baseline is updated by at each optimization step. Another technique that affects the training speed is the reward normalization. Concretely, the accumulated rewards at each time step for each rollout are collected and normalized after subtraction of the baseline value.

We introduce a regularization term on the entropy of the resulting probability distribution from the policy network , which enforces that the agent explores the SG. The regularization is controlled by a hyperparameter . In addition, we apply exponential decay to during training so that converges to zero.

Moreover, we use the chain rule to calculate the gradients of the parameters of the graph encoder (GAT)

and the question encoder (Transformer) . The weight updates can be performed via gradient ascent, or more advanced optimization methods such as Adam [21].

Input: Question , Scene Graph
Model :  Policy Network with , Baseline with
1 for  to  do // Loop over epochs
2       Initialize and with GloVe embeddings;
       // Update the question with the question encoder
       // Update the SG with the graph encoder
       // Initialize the trajectory buffer
3       for  to  do // Loop over samples
             // Initialize the trajectory
             hub // Initialize the start position
             dummy // Initialize the dummy start action
4             for  to  do // Loop over time steps
5                   if  then // Restart and prompt the agent to the hub node
                         // so that the agent is aware of its own action
                         hub // Set next nodes to the hub node
                         dummy // Set next actions to the dummy return action
7                   end if
                  Sample an action (, ) from .append(, ) // Extend the trajectory
                   // Move the agent to the next entity
9             end for
            .append() // Collect the trajectory
11       end for
       // Gather rewards
       // Approximate gradients
       // Update the policy network
       // Update the baseline function
13 end for
Algorithm 1 Training regime

0.a.2 Inference

Beam search is used to infer the answer to a given question. Our inference approach is based on evaluating how likely specific paths are appearing among all possible paths with a fixed length. More specifically, given an input question, the agent’s initial location is given by the hub node. At each time step, the agent scores the next permissible actions based on the learned policy. The value of action represents the transition probability from the current node to a target node. Next, we keep the top (also known as beam width) paths among all possible transitions and move the agent to the corresponding targets. This computation is iteratively performed until the maximum number of transitions is reached. In the end, we obtain multiple rollouts ranked by the path probabilities. The target node (i.e. the last node) of the path is regarded as an answer candidate. Unlike Monte Carlo sampling which does not consider path probabilities, beam search yields better answer candidates, as it always chooses the best choice within the search region. The algorithm for inference is summarized in table 2.

Inference Complexity

The inference of our method is computationally efficient. Unlike other methods that need to iterate through each candidate answer for a final prediction, we only need to run the inference once so that the score of each answer is obtained. Let denote the embedding dimension of the words and entities. Analytically, the embedding stage has asymptotic complexity . For the GAT, the implementation of a single attention head and multi-head attention is similar. In particular, they have the same time complexity . The computation of the question encoding is given by . It is efficient as it only runs once for each question and is used for arbitrary times during random walks. Also, the length of the questions is usually short (less than 30 words). Finally, during the random walk sampling, the agents complxity is given by , where is dominant. The inference time depends largely on the path length.

Input: Question , Scene graph
Output: Answer
1 Initialize and with GloVe embeddings;
// Update the question with the question encoder
// Update the SG with the graph encoder.
// Initialize the probability register
// Initialize the trajectory
hub // Initialize the start position
dummy // Initialize the dummy start action
2 for  to  do // Loop over time steps
3       for  to  do // Loop over rollouts
4             if  then // Restart and prompt the agent to the hub node
                   // so that the agent is aware of its own action
                   hub // Set next nodes to the hub node
                   dummy // Set next actions to the dummy return action
6             end if
            Forward pass through the policy network to generate candidate actions along with their probabilities .append() // Extend the trajectory
             .append() // Store corresponding probabilities
8       end for
       // Filter indices of top k probabilities from P
       // Choose top k paths ranked by their probabilities
       // Conduct corresponding transitions
10 end for
Prediction // Predict the end entity of the top path as the answer
Algorithm 2 Inference with beam search

0.a.3 Complexity Analysis

For analyzing the complexity of our method, we provide all the parameters contained in the building blocks. Moreover, we present the number of operations of a forward pass - the complete run that derives the answer from a given and . They are listed in the table 3.

Group Name No. Parameters No. Operations
Word Embeddings* Entity
GAT Conv layer weight
Conv layer attention
Conv layer bias
Transformer Positional encoder
Layer self attention
Self attn norm
Layer enc attn
Enc attn norm
Pos ffn 1
Pos ffn 2
Pos ffn norm
Enc attn norm
Agent-MLP Dense 0
Dense 1
Agent-LSTM Lstm_cell
Table 3: An overview of the number parameters and the asymptotic number of operations for the individual modules. The batch size is indicated by . corresponds to the number of time steps. and denote the embedding size and hidden size, respectively. Blocks are marked with a ”*” if their weights are not trainable.

Appendix 0.B Additional Details on the Dataset GQA

In this section we describe various question category and their type. We list the question based on semantic and structural categories. We further grouped them based on their entity type like object, attribute, category etc. Table 4, describes the detailed list of question category.

Category Type Description Example
Semantics Object Existence of object Are there any doors that are not made of metal?
Attribute Property about an object Does the soap dispenser that is to the right of the other soap dispenser have small size and white color?
Category Identify an object class What kind of animal is standing?
Relation Relationship of object What is the food that is to the left of the white object that is to the left of the chocolate called?
Global Overall scene property Which place is it?
Structural Query Open-form question What type of furniture is to the left of the silver device which is to the left of the helmet?
Choose Choose from alternatives What are the floating people in the ocean doing, riding or swimming?
Verify Simple yes/no question Are there statues above the brass clock that is on the building?
Compare Comparison of objects Are the drawers made of the same material as the cages?
Logical And/or operators Are both the giraffe near the building and the giraffe that is to the left of the tray standing?
Table 4: List of question examples in the GQA dataset.