Visual Question Answering (VQA) is a challenging task that involves understanding and reasoning over two data modalities, i.e., images and natural language. Given an image and a free-form question which formulates a query about the presented scene — the issue is for the algorithm to find the correct answer.
VQA has been studied from the perspective of scene and knowledge graphs [33, 6], as well as vision-language reasoning [10, 1]. To study VQA, various real-world data sets, such as the VQA data set [4, 24], have been generated. It has been argued that, in the VQA data set, many of the apparently challenging reasoning tasks can be solved by an algorithm through exploiting trivial prior knowledge, and thus by shortcuts to proper reasoning (e.g., clouds are white or doors are made of wood). To address these shortcomings, the GQA dataset  has been developed. Compared to other real-world datasets, GQA is more suitable for evaluating reasoning abilities since the images and questions are carefully filtered to make the data less prone to biases.
Plenty of VQA approaches are agnostic towards the explicit relational structure of the objects in the presented scene and rely on monolithic neural network architectures that process regional features of the image separately[2, 39]. While these methods led to promising results on previous datasets, they lack explicit compositional reasoning abilities, which results in weaker performance on more challenging datasets such as GQA. Other works [34, 31, 17] perform reasoning on explicitly detected objects and interactive semantic and spatial relationships among them. These approaches are closely related to the scene graph representations  of an image, where detected objects are labeled as nodes and relationships between the objects are labeled as edges. In this work, we aim to combine VQA techniques with recent research advances in the area of statistical relation learning on knowledge graphs (KGs). KGs provide human-understandable, structured representations of knowledge about the real world via collections of factual statements. Inspired by multi-hop reasoning methods on KGs such as [8, 38, 12], we propose Graphhopper, a novel method that models the VQA task as a path-finding problem on scene graphs. The underlying idea can be summarized with the phrase: Learn to walk to the correct answer. More specifically, given an image, we consider a scene graph and train a reinforcement learning agent to conduct a policy-guided random walk on the scene graph until a conclusive inference path is obtained. In contrast to purely embedding-based approaches, our method provides explicit reasoning chains that lead to the derived answers. To sum up, our major contributions are as follows.
Graphhopper is the first VQA method that employs reinforcement learning for multi-hop reasoning on scene graphs.
We conduct a thorough experimental study on the challenging VQA dataset named QGA to show the compositional and interpretable nature of our model.
To analyze the reasoning capabilities of our method, we consider manually curated (ground truth) scene graphs. This setting isolates the noise associated with the visual perception task and focuses solely on the language understanding and reasoning task. Thereby, we can show that our method achieves human-like performance.
Based on both the manually curated scene graphs and our own automatically generated scene graphs, we show that Graphhopper outperforms the Neural State Machine (NMS), a state-of-the-art scene graph reasoning model that operates in a setting, similar to Graphhopper.
Moreover, we are the first group to conduct experiments and publish the code on generated scene graphs for the GQA dataset111Code is available at :https://github.com/rajatkoner08/Graphhopper.The remainder of this work is organized as follows. We review related literature in the next section. Section 3 introduces the notation and describes the methodology of Graphhopper. Section 4 and Section 5 detail an experimental study on the benchmark dataset GQA. Furthermore, through a rigorous study using both manually-curated ground-truth and generated scene graphs, we examine the reasoning capabilities of Graphhopper. We conclude in Section 6.
2 Related Work
Visual Question Answering:
. Currently, leading VQA approaches can be categorized into two different branches: First, monolithic neural networks, which perform implicit reasoning on latent representations obtained from fusing the two data modalities. Second, multi-hop methods that form explicit symbolic reasoning chains on a structured representation of the data. Monolithic network architectures obtain visual features from the image either in the form of individual detected objects or by processing the whole image directly via convolutional neural networks (CNNs). The derived embeddings are usually scored against a fixed answer set along with the embedding of the question obtained from a sequence model. Moreover, co-attention mechanisms are frequently employed to couple the vision and the language models allowing for interactions between objects from both modalities[20, 2, 5, 40, 41]. Monolithic networks are among the dominant methods on previous real-world VQA datasets such as . However, they suffer from the black-box problem and possess limited reasoning capabilities with respect to complex questions that require long reasoning chains (see  for a detailed discussion).
Explicit reasoning methods combine the sub-symbolic representation learning paradigm with symbolic reasoning approaches over structured representations of the image. Most of the popular explicit reasoning approaches follow the idea of neural module networks (NMNs)  which perform a sequence of reasoning steps realized by forward passes through specialized neural networks that each correspond to predefined reasoning subtasks. Thereby, NMNs construct functional programs by dynamically assembling the modules resulting in a question-specific neural network architecture. In contrast to the monolithic neural network architectures described above, these methods contain a natural transparency mechanism via functional programs. However, while NMN-related methods (e.g., [14, 26]) exhibit good performance on synthetic datasets such as CLEVR , they require functional module layouts as additional supervision signals to obtain good results. Closely related to our method is the Neural State Machine (NSM) proposed by . NSM’s underlying idea consists of first constructing a scene graph from an image and treating it as a state machine. Concretely, the nodes correspond to states and edges to transitions. Then, conditioned on the question, a sequence of instructions is derived that indicates how to traverse the scene graph and arrive at the answer. In contrast to NSM, we treat path-finding as a decision problem in a reinforcement learning setting. Concretely, we outline in the next section how extracting predictive paths from scene graphs can be naturally formulated in terms of a goal-oriented random walk induced by a stochastic policy that allows the approach to balance between exploration and exploitation. Moreover, our framework integrates state-of-the-art techniques from graph representation learning and NLP. This paper only considers basic policy gradient methods, but more sophisticated reinforcement learning techniques will be employed in future works.
Statistical Relational Learning:
Machine learning methods for KG reasoning aim at exploiting statistical regularities in observed connectivity patterns. These methods are studied under the umbrella of statistical relational learning (SRL) 
. In recent years, KG embeddings have become the dominant approach in SRL. The underlying idea is that graph features that explain the connectivity pattern of KGs can be encoded in low-dimensional vector spaces. In the embedding spaces, the interactions among the embeddings for entities and relations can be efficiently modeled to produce scores that predict the validity of a triple. Despite achieving good results in KG reasoning tasks, most embedding-based methods have problems capturing the compositionality expressed by long reasoning chains. This often limits their applicability in complex reasoning tasks. Recently, multi-hop reasoning methods such as MINERVA and DeepPath 
were proposed. Both methods are based on the idea that a reinforcement learning agent is trained to perform a policy-guided random walk until the answer entity to a query is reached. Thereby, the path finding problem of the agent can be modeled in terms of a sequential decision making task framed as a Markov decision process (MDP). The method that we propose in this work follows a similar philosophy, in the sense that we train an RL agent to navigate on a scene graph to the correct answer node. However, a conceptual difference is that the agents in MINERVA and DeepPath perform walks on large-scale knowledge graphs exploiting repeating statistical patterns. Thereby, the policies implicitly incorporate approximate rules. In addition, instead of free-form processing questions, the query in the KG reasoning setting is structured as a pair of symbolic entities. That is why we propose a wide range of modifications to adjust our method to the challenging VQA setting.
The task of VQA is framed as a scene graph traversal problem. Starting from a hub node that is connected to all other nodes, an agent sequentially samples transitions to neighboring nodes on the scene graph until the node corresponding to the answer is reached. In this way, by adding transitions to the current path, the reasoning chain is successively extended. Before describing the decision problem of the agent, we introduce the notation that we use throughout this work.
A scene graph is a directed multigraph where each node corresponds to a scene entity which is either an object associated with a bounding box or an attribute of an object. Each scene entity comes with a type that corresponds to the predicted object or attribute label. Typed edges specify how scene entities are related to each other. More formally, let denote the set of scene entities and consider the set of binary relations . Then a scene graph is a collection of ordered triples – subject, predicate, and object. For example, as shown in Figure 1, the triple (motorcycle-1, has_part, tire-1) indicates that both a motorcycle (subject) and a tire (object) are detected in the image. The predicate has_part indicates the relation between the entities. Moreover, we denote with the inverse relation corresponding to the predicate . For the remainder of this work, we impose completeness with respect to inverse relations in the sense that for every it is implied that .
The state space of the agent is given by where are the nodes of a scene graph and denotes the set of all questions. The state at time is the entity at which the agent is currently located and the question . Thus, a state for time is represented by . The set of available actions from a state is denoted by . It contains all outgoing edges from the node together with their corresponding object nodes. More formally, Moreover, we denote with the action that the agent performed at time . We include self-loops for each node in that produce a NO_OP-label. These self-loops allow the agent to remain at the current location if it reaches the answer node. Furthermore, the introduction of inverse relations allows agent to transit freely in any direction between two nodes.
The environments evolve deterministically by updating the state according to previous action. Formally, the transition function at time is given by with and .
Auxiliary Nodes : In addition to standard entity relation nodes present in a scene graph, we introduce a few auxiliary nodes (e.g. hub node). The underlying rationale for the inclusion of auxiliary nodes is that they facilitate the walk for the agent or help to frame the QA-task as a goal-oriented walk on the scene graph. These additional nodes are included during run-time graph traversal, but they are ignored during the compile time such as when computing node embedding. For example, we add a hub node (hub) to every scene graph which is connected to all other nodes. The agent then starts the scene graph traversal from a hub with global connectivity. Furthermore for a binary question, we add YES and NO nodes to the scene entities that correspond to the final location of the agent. The agent can then transition to either the YES or the NO node.
Question and Scene Graph Processing
We initialize words in with GloVe embeddings  with dimension . Similarly we initialize entities and relations in with the embeddings of their type labels. In the scene graph, the node embeddings are passed through a multi-layered graph attention network (GAT) . Extending the idea from graph convolutional networks  with a self-attention mechanism, GATs mimic the convolution operator on regular grids where an entity embedding is formed by aggregating node features from its neighbors. Relations and inverse relations between nodes allows context to flow in both ways through GAT. Thus, the resulting embeddings are context-aware, which makes nodes with the same type, but different graph neighborhoods, distinguishable. To produce an embedding for the question , we first apply a Transformer , followed by a mean pooling operation.
Finally, since we added auxiliary YES and NO
nodes to the scene graph for binary questions, we train a feedforward neural network to classify query-type (i.e., questions that query for an object in the depicted scene) and binary questions. This network consists of two fully connected layers with ReLU activation on the intermediate output. We find that it is easy to distingquish between query and binary questions (e.g., query questions usually begin withWhat, Which, How, etc., whereas binary questions usually begin with Do, Is, etc.). Since our classifier achieves 99.99% accuracy we will ignore the error in question classification in the following discussions.
We denote the agent’s history until time with the tuple for and along with for . The history is encoded via a multilayered LSTM 
where corresponds to the embedding of the previous action with and denoting the embeddings of the edge and the target node into , respectively. The history-dependent action distribution is given by
where the rows of contain latent representations of all admissible actions. Moreover, encodes the question . The action is drawn according to . Equations (1) and (2) induce a stochastic policy , where denotes the set of trainable parameters.
Rewards and Optimization
After sampling transitions, a terminal reward is assigned according to
We employ REINFORCE  to maximize the expected rewards. Thus, the agent’s maximization problem is given by
where denote the set of training questions. During training the first expectation in Equation (4
) is substituted with the empirical average over the training set. The second expectation is approximated by the empirical average over multiple rollouts. We also employ a moving average baseline to reduce the variance. Further, we use entropy regularization with parameter
to enforce exploration. During inference, we do not sample paths but perform a beam search with width 20 based on the transition probabilities given by Equation (2).
Additional details on the model, the training and the inference procedure along with sketches of the algorithms, and a complexity analysis can be found in the supplementary material.
4 Dataset and Experimental Setup
In this section we introduce the dataset and detail the experimental protocol.
The GQA dataset  has been introduced with the goal of addressing key shortcomings of previous VQA datasets, such as CLEVR  or the VQA dataset . GQA is more suitable for evaluating the reasoning and compositional abilities of a model in a realistic setting. It contains 113K images, and around 1.2M questions split into roughly for the training, validation, and testing. The overall vocabulary size consists of 3097 words, including 1702 object classes, 310 relationships, and 610 object attributes.
Due to the large number of objects and relationships present in GQA, we used a pruned version of the dataset (see Section 5) for our generated scene graph. In this work, we have conducted two primary experiments. First, we report the results on manually curated scene graphs provided in the GQA
dataset. In this setting, the true reasoning and language understanding capabilities of our model can be analyzed. Afterward, we evaluate the performance of our model with the generated scene graphs on pruned GQA dataset. It shows the performance of our model on noisy generated data. We have used state of the art Relation Transformer Network (RTN) for the scene graph generation and DetectoRS  for object detection. We have conducted all the experiments on “test-dev” split of the GQA.
The questions are designed to evaluate the reasoning abilities such as visual verification, relational reasoning, spatial reasoning, comparison, and logical reasoning. These questions can be categorized either according to structural or semantic criteria. An overview of the different question types is given in supplementary (see Table 4).
4.2 Experimental Setup
Scene Graph Reasoning:
Regarding the model parameters, we apply 300 dimensional GloVe embeddings to both the questions and the graphs (i.e., edges and nodes). Moreover, we employ a two-layer GAT  model. The dropout  probability of each layer is set to 0.1. The first layer has eight attention heads. Each head has eight latent features which are concatenated to form the output features of that layer. The output layer has eight attention heads with mean aggregation, so that the output also has 300-dimensional features. We apply dropout with to the attention coefficients at each layer. This essentially means that each node is exposed to a stochastically sampled neighborhood during training. Moreover, we employ a two-layer Transformer  decoder model. The model dimension is set to 300, and the key and query dimensions are both set to 64 with dropout . The LSTM of the policy networks consists of a uni-directional layer with hidden size 300. Finally, the agent performs a fixed number of transitions. In question answering, most questions concern one subject to be explored within one reasoning path originated from the start node. Hence, we set the maximum number of steps to 4, without resetting. By contrast, the binary questions have 8 steps and a reset frequency of 4. In other words, the agent is prompted to the hub node after the fourth step.
Training the Graphhopper:
In terms of the training procedure, the GAT, the Transformer, and the policy networks are initialized with Glorot  initialization. We train our model with data from the val_balanced_questions tier. We use a batch size of 64 and sample a batch of questions along with their associated graphs. We collect 20 stochastic rollouts for each question performed in a vectorized form to utilize parallel computation. For each batch, we collect the rewards when a complete forward pass is done. Then the gradients are approximated from the rewards and applied to update the weights. We employ the Adam optimizer  with a learning rate of for all trainable weights. The coefficient for the action entropy, which balances exploration and exploitation, starts from 0.2 and decreases exponentially at each step with a factor 0.99.
4.3 Performance Metrics
Along with the accuracy (i.e., Hits@1) on open questions (“Open”), binary questions (yes/no) (“Binary”), and the overall accuracy (“Accuracy”), we also report the additional metric “Consistency” (answers should not contradict themselves), “Validity” (answers are in the range of a question; e.g., red is a valid answer when asked for the color of an object), “Plausibility” (answers should be reasonable; e.g., red is a reasonable color of an apple reasonable, blue is not), as proposed in .
5 Results and Discussion
As outlined before, VQA is a challenging task, and there is still a significant performance gap between state-of-the-art VQA methods and human performance on challenging, real-world datasets such as GQA (see ). Similar to other existing methods, our architecture involves multiple components, and it important to be able to analyse the performance of the different modules and processing steps in isolation. Therefore we first present the results of our experiments on manually curated, ground-truth scene graphs provided in the GQA dataset and compare the performance of Graphhopper against NSM and humans. This setting allows us to isolate the noise from the visual perception component and quantify our methods’ reasoning capabilities. Subsequently, we present the results with our own generated scene graphs.
In addition, we also observed that the inclusion of auxiliary nodes helps the agent to achieve efficient performance. Hub
node performs better compare to starting from any random nodes, as its facilitate easier forward and backtracking from a node. For binary question instead of YES or NO node, we experimented where the path of the agent was processed by another classifier (e.g., a logistic regression) and the classification logits were assigned as rewards. However, this led to inferior results; most likely due to the absence of a weight-sharing mechanism and due to the noisy reward signal produced by the classifier. These observations supports our assumption on the role of auxiliary nodes we have used in scene graph.
proposed the state of the art method named NSM for VQA. NSM is the conceptually most similar method, as it also exploits the scene graph reasoning for VQA. We consider NSM to be our baseline method for comparison. However, their approach to reasoning is different from ours. To compare the reasoning ability of our method with the same generated scene graph, we tried to reproduce NSM, as the code for NSM is not open-sourced. We have used the the available parameters from and the implementation from .
5.1 Results on Manually Curated Scene Graphs
In this section, we report on an experimental study with Graphhopper on the manually curated scene graphs provided along with the GQA dataset. Table 1 shows the performance of Graphhopper and compares it with the human performance reported in  and with the performance of NSM on the same underlying manually curated scene graphs. We find that Graphhopper strictly outperforms NSM with respect to all performance measures. In particular, on the open questions, the performance gap is significant. Moreover, Graphhopper also slightly outperforms humans with respect to the accuracy on both types of questions. On the other hand, concerning the supplementary performance measures consistency, validity, and plausibility, Graphhopper is outperformed by humans but nevertheless consistently reaches high values. Overall, these results can be seen as a testament of the reasoning capabilities and establish an upper bound to the performance of Graphhopper.
5.2 Results on automatically generated graph
The process of generating a graph representation for visual data is a costly and complex procedure. Although the scene graph generation is not the main focus of this work, it constituted one of the major challenges to create good scene graph for GQA due to the following facts:
There is no open source code for GQA scene graph generation or object detection.
A large number of instances and an uneven class distribution in GQA leads to a significant drop in the accuracy compared to existing scene graph datasets (see ).
There is a lack of attribute prediction models in modern object detection frameworks.
In this work, we adress all of these challenges as our model’s performance is directly dependent on the quality of the scene graph. We will also open-source our code base for transparency and accelerate the development scene graph-based reasoning for VQA.
Generation of Scene Graph:
To address these problems, first, we choose two state-of-the-art network, RTN  for scene graph generation, and DetectoRS  for object detection. The transformer  based architecture of RTN and its contextual scene graph embedding is most closely related to our architecture and for our future expansion. To make Graphhopper generic to any scene graph generator, we haven’t use contextualized embedding from RTN, instead we rely on GAT for contextualization.
Pruning of GQA:
GQA has more than 6 times the number of relationships compared to Visual Genome 
, which is the most used scene graph generation dataset, and contains more than 18 times the number of objects compared to the most common object detection dataset COCO
. Also, the class distribution is highly skewed which causes a significant drop in the accuracy for both the object detection and the scene graph generation task. To efficiently prune the number of instances, we take the first 800 classes, 170 relationships, and 200 attributes based on their frequency of occurrence in the training questions and answers. This pruning allows us to reduce more thanof the words while covering more than of the combined answers in the training set.
One of the shortcomings of existing scene graph generation and object detection networks is that they do not predict the attributes (e.g., the color or size of an object) of a detected object. Therefore, we have incorporated the attribute prediction for answering the question on GQA. Contextualized object embedding from RTN  is used for attribute prediction as
where ,, ,
are the weight matrices of a linear layer, the contextual embedding of an object, the probability distribution over all objects and the probability distribution over the attributes.
denotes the sigmoid function.
We have trained both the object detector and the scene graph generator on a pruned version of GQA with their respective default parameters after the prepossessing. This helps to increase the coverage of all the instances (e.g., objects, attributes, relationships ) on training questions from to implying that our generated scene graph now covers of all instances that represent answers to the training questions.
Table 2, shows the performance of Graphhopper in two settings: First, with a generated graph where we predict the classes, the attributes, and relationships using our own pipeline. Second, where we only use the predicted relationships from RTN  (with ground truth objects and attributes). We find that Graphhopper consistently outperforms NSM  based on the generated graph. Moreover, in the “pr” or predicted relations setting, it achieves an even higher score as the graphs do not contain any misprediction from the object detector. These encouraging results show superior reasoning abilities both on the generated graph and generated relationships between objects.
5.3 Discussion on the Reasoning Ability
To further analyze the reasoning abilities of Graphhopper, Figure 4 disentangles the results according to different types of questions: 5 semantic types (left) and 5 structural types (middle). Moreover, we report the performance of Graphhopper according to the length of the reasoning path (right) (see the supplementary material for additional information). Moreover, we show the performance of Graphhopper separately for each of the three scene graph settings that we considered in this work. Figure 4 shows performance on a manually curated scene graph that depicts the actual performance in an ideal environment. Figure 4 illustrates the performance based on only the predicted relationships between objects. This setting shows the performance of Graphhopper along with a scene graph generator. Finally, Figure 4 depicts the performance based on the object detector, the scene graph generator, and Graphhopper. First and foremost, we find that Graphhopper consistently achieves high accuracy on all types of questions in every setting. Moreover, we find that the performance of Graphhopper does not suffer if answering the questions requires many reasoning steps. We conjecture that this is because high-complexity questions are harder to answer, but due to proper contextualization of the embeddings (e.g., via the GAT and the Transformer), the agent can extract the specific information that identifies the correct target node. The good performance on these high-complexity questions can be seen as evidence that Graphhopper can efficiently translate the question into a transition on the scene graph hopping until the correct answer is reached.
Examples of Reasoning Path:
Figure 3 shows three examples of scene graph traversals of Graphhopper that lead to the correct answer. One can see in these examples that the sequential reasoning process over explicit scene graph entities makes the reasoning process more comprehensible. In the case of wrong predictions, the extracted path may offer insights into the mechanics of Graphhopper and facilitate debugging.
We have proposed Graphhopper, a novel method for visual question answering that integrates existing KG reasoning, computer vision, and natural language processing techniques. Concretely, an agent is trained to extract conclusive reasoning paths from scene graphs. To analyze the reasoning abilities of our method, we conducted a rigorous experimental study on both manually curated and generated scene graphs. Based on the manually curated scene graphs we showed that Graphhopper reaches human performance. Moreover, we find that, on our own automatically generated scene graph, Graphhopper outperform another state-of-the-art scene graph reasoning model with respect to all considered performance metrics. In future works, we plan to combine scene graphs with common sense knowledge graphs to further enhance the reasoning abilities of Graphhopper.
Counterfactual vision and language learning.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10044–10054. Cited by: §1.
Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 6077–6086. Cited by: §1, §2.
-  (2016) Neural module networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 39–48. Cited by: §2.
-  (2015) Vqa: visual question answering. In Proceedings of the IEEE international conference on computer vision, pp. 2425–2433. Cited by: §1, §2, §4.1.
-  (2019) Murel: multimodal relational reasoning for visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1989–1998. Cited by: §2.
-  (2019) Counterfactual critic multi-agent training for scene graph generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4613–4623. Cited by: §1.
-  (2019) Meta module network for compositional visual reasoning. arXiv preprint arXiv:1910.03230. Cited by: §2.
-  (2018) Go for a walk and arrive at the answer: reasoning over paths in knowledge bases using reinforcement learning. In ICLR, Cited by: §1, §2.
-  (2019) NSM. GitHub. Note: https://github.com/charlespwd/project-title Cited by: §5.
-  (2020) Large-scale adversarial training for vision-and-language representation learning. arXiv preprint arXiv:2006.06195. Cited by: §1.
Understanding the difficulty of training deep feedforward neural networks.
Proceedings of the thirteenth international conference on artificial intelligence and statistics, pp. 249–256. Cited by: §4.2.
-  (2020) Reasoning on knowledge graphs with debate dynamics. arXiv preprint arXiv:2001.00461. Cited by: §1.
-  (1997) Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §3.
-  (2017) Learning to reason: end-to-end module networks for visual question answering. In Proceedings of the IEEE International Conference on Computer Vision, pp. 804–813. Cited by: §2.
-  (2018) Compositional attention networks for machine reasoning. arXiv preprint arXiv:1803.03067. Cited by: §2.
-  (2019) GQA: a new dataset for real-world visual reasoning and compositional question answering. arXiv preprint arXiv:1902.09506. Cited by: §1, §2, §4.1, §4.3, §5.1, Table 1, §5.
-  (2019) Learning by abstraction: the neural state machine. In Advances in Neural Information Processing Systems, pp. 5901–5914. Cited by: §1, §5, §5.2, Table 1, Table 2.
-  (2017) Clevr: a diagnostic dataset for compositional language and elementary visual reasoning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2901–2910. Cited by: §2, §2, §4.1.
-  (2015) Image retrieval using scene graphs. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3668–3678. Cited by: §1.
-  (2018) Bilinear attention networks. In Advances in Neural Information Processing Systems, pp. 1564–1574. Cited by: §2.
-  (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §0.A.1, §4.2.
-  (2016) Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907. Cited by: §3.
-  (2020) Relation transformer network. arXiv preprint arXiv:2004.06193. Cited by: §4.1, Figure 4, §5.2, §5.2, §5.2, Table 2.
-  (2017) Visual genome: connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision 123 (1), pp. 32–73. Cited by: §1, 2nd item, §5.2.
-  (2014) Microsoft coco: common objects in context. In European conference on computer vision, pp. 740–755. Cited by: §5.2.
-  (2019) The neuro-symbolic concept learner: interpreting scenes, words, and sentences from natural supervision. arXiv preprint arXiv:1904.12584. Cited by: §2.
-  (2015) A review of relational machine learning for knowledge graphs. Proceedings of the IEEE 104 (1), pp. 11–33. Cited by: §2.
-  (2017) Automatic differentiation in pytorch. Cited by: §4.2.
-  (2014) GloVe: global vectors for word representation. In Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543. External Links: Cited by: §3.
-  (2020) DetectoRS: detecting objects with recursive feature pyramid and switchable atrous convolution. arXiv preprint arXiv:2006.02334. Cited by: §4.1, Figure 4, §5.2.
-  (2019) Explainable and explicit visual reasoning over scene graphs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8376–8384. Cited by: §1.
-  (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15 (1), pp. 1929–1958. Cited by: §4.2.
-  (2020) Unbiased scene graph generation from biased training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3716–3725. Cited by: §1.
-  (2017) Graph-structured representations for visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–9. Cited by: §1.
-  (2017) Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §3, §4.2, §5.2.
-  (2017) Graph attention networks. arXiv preprint arXiv:1710.10903. Cited by: §3, §4.2.
-  (1992) Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning 8 (3-4), pp. 229–256. Cited by: §0.A.1, §3.
-  (2017-09) DeepPath: a reinforcement learning method for knowledge graph reasoning. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (EMNLP 2017), Copenhagen, Denmark. Cited by: §1, §2.
-  (2016) Stacked attention networks for image question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 21–29. Cited by: §1.
-  (2017) Multi-modal factorized bilinear pooling with co-attention learning for visual question answering. In Proceedings of the IEEE international conference on computer vision, pp. 1821–1830. Cited by: §2.
-  (2017) Structured attentions for visual question answering. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1291–1300. Cited by: §2.
Appendix 0.A Details on Model Training and Inference
0.a.1 Training Details
where is the discount factor for the reward. The gradient of the weights are aggregated over multiple rollouts. To reduce the variance, we adopt a moving average baseline function . The baseline function is an approximation of the value of a state . We could have employed more sophisticated methods such as advantage network or actor-critic algorithm. However, we find the current baseline works sufficiently well. Formally, the baseline function consists of a non-trainable variable
and a hyperparameter. The baseline is updated by at each optimization step. Another technique that affects the training speed is the reward normalization. Concretely, the accumulated rewards at each time step for each rollout are collected and normalized after subtraction of the baseline value.
We introduce a regularization term on the entropy of the resulting probability distribution from the policy network , which enforces that the agent explores the SG. The regularization is controlled by a hyperparameter . In addition, we apply exponential decay to during training so that converges to zero.
Beam search is used to infer the answer to a given question. Our inference approach is based on evaluating how likely specific paths are appearing among all possible paths with a fixed length. More specifically, given an input question, the agent’s initial location is given by the hub node. At each time step, the agent scores the next permissible actions based on the learned policy. The value of action represents the transition probability from the current node to a target node. Next, we keep the top (also known as beam width) paths among all possible transitions and move the agent to the corresponding targets. This computation is iteratively performed until the maximum number of transitions is reached. In the end, we obtain multiple rollouts ranked by the path probabilities. The target node (i.e. the last node) of the path is regarded as an answer candidate. Unlike Monte Carlo sampling which does not consider path probabilities, beam search yields better answer candidates, as it always chooses the best choice within the search region. The algorithm for inference is summarized in table 2.
The inference of our method is computationally efficient. Unlike other methods that need to iterate through each candidate answer for a final prediction, we only need to run the inference once so that the score of each answer is obtained. Let denote the embedding dimension of the words and entities. Analytically, the embedding stage has asymptotic complexity . For the GAT, the implementation of a single attention head and multi-head attention is similar. In particular, they have the same time complexity . The computation of the question encoding is given by . It is efficient as it only runs once for each question and is used for arbitrary times during random walks. Also, the length of the questions is usually short (less than 30 words). Finally, during the random walk sampling, the agents complxity is given by , where is dominant. The inference time depends largely on the path length.
0.a.3 Complexity Analysis
For analyzing the complexity of our method, we provide all the parameters contained in the building blocks. Moreover, we present the number of operations of a forward pass - the complete run that derives the answer from a given and . They are listed in the table 3.
|Group||Name||No. Parameters||No. Operations|
|GAT||Conv layer weight|
|Conv layer attention|
|Conv layer bias|
|Layer self attention|
|Self attn norm|
|Layer enc attn|
|Enc attn norm|
|Pos ffn 1|
|Pos ffn 2|
|Pos ffn norm|
|Enc attn norm|
Appendix 0.B Additional Details on the Dataset GQA
In this section we describe various question category and their type. We list the question based on semantic and structural categories. We further grouped them based on their entity type like object, attribute, category etc. Table 4, describes the detailed list of question category.
|Semantics||Object||Existence of object||Are there any doors that are not made of metal?|
|Attribute||Property about an object||Does the soap dispenser that is to the right of the other soap dispenser have small size and white color?|
|Category||Identify an object class||What kind of animal is standing?|
|Relation||Relationship of object||What is the food that is to the left of the white object that is to the left of the chocolate called?|
|Global||Overall scene property||Which place is it?|
|Structural||Query||Open-form question||What type of furniture is to the left of the silver device which is to the left of the helmet?|
|Choose||Choose from alternatives||What are the floating people in the ocean doing, riding or swimming?|
|Verify||Simple yes/no question||Are there statues above the brass clock that is on the building?|
|Compare||Comparison of objects||Are the drawers made of the same material as the cages?|
|Logical||And/or operators||Are both the giraffe near the building and the giraffe that is to the left of the tray standing?|