Grounded language comprehension tasks, such as visual question answering (VQA) or referring expression comprehension (REF), require finding the relevant objects in the scene and reasoning about certain relationships between them. For example in Figure 1, to answer the question is there a person to the left of the woman holding a blue umbrella, we must locate the relevant objects – person, woman and blue umbrella – and model the specified relationships – to the left of and holding.
How should we build a model to perform reasoning in grounded language comprehension tasks? Prior works have explored various approaches from learning joint visual-textual representations ([8, 29]) to pooling over pairwise relationships ([33, 41]) or constructing explicit reasoning steps with modular or symbolic representations ([2, 40]). Although these models are capable of performing complex relational inference, their scene representations are built upon local visual appearance features that do not contain much contextual information. Instead, they tend to rely heavily on manually designed inference structures or modules to perform reasoning about relationships, and are often specific to a particular task.
In this work, we propose an alternative way to facilitate reasoning with a context-aware scene representation, suitable for multiple tasks. Our proposed Language-Conditioned Graph Network (LCGN) model augments the local appearance feature of each entity in the scene with a relational contextualized feature. Our model is a graph network built upon visual entities in the scene, which collects relational information through multiple iterations of message passing between the entities. It dynamically determines which objects to collect information from on each round, by weighting the edges in the graph, and sends messages through the graph to propagate just the right amount of relational information. The key idea is to condition the message passing on the specific contextual relationships described in the input text. Figure1 illustrates this process, where the person would be represented not only by her local appearance, but also by contextualized features indicating her relationship to other relevant objects in the scene, e.g., left of a woman. Our contextualized representation can be easily plugged into task-specific models to replace standard local appearance features, facilitating reasoning with rich relational information. E.g. for the question answering task, it is sufficient to perform a single attention hop over the relevant object, whose representation is contextualized (e.g. blue box in Figure 1).
Importantly, our scene representation is constructed with respect to the given reasoning task. An object in the scene may be involved in multiple relations in different contexts: in Figure 1, the person can be simultaneously left of a woman holding a blue umbrella, holding a white bag, and standing on a sidewalk. Rather than building a complete representation of all the first- and higher-order relational information for each object (which can be enormous and unnecessary), we focus the contextual representation on relational information that is helpful to the reasoning task by conditioning on the input text (Figure 1 left vs. right).
We apply our Language-Conditioned Graph Networks to two reasoning tasks with language inputs—Visual Question Answering (VQA) and Referring Expression Comprehension (REF). In these tasks, we replace the local appearance-based visual representations with the context-aware representations from our LCGN model, and demonstrate that our context-aware scene representations can be used as inputs to perform complex reasoning via simple task-specific approaches, with a consistent improvement over the local appearance features across different tasks and datasets. We obtain state-of-the-art results on the GQA dataset  for VQA and the CLEVR-Ref+ dataset  for REF.
2 Related work
We first provide an overview of the reasoning tasks addressed in this paper. Then we review related work on graph networks and other contextualized representations. Finally, we discuss alternative approaches to reasoning problems.
Visual question answering (VQA) and referring expression comprehension (REF)
VQA and REF are two popular tasks that require reasoning about image content. While in VQA the goal is to answer a question about an image , in REF one has to localize an image region that corresponds to a referring expression . While the real-world VQA dataset [3, 10] focuses more on perception than complex reasoning, the more recent synthetic CLEVR  dataset is a standard benchmark for relational reasoning. An even more recent GQA dataset  brings together the best of both worlds: real images and relational questions. It is built upon the Visual Genome dataset  and construct the balanced question-answer pairs from scene graphs.
For REF, there are a number of standard benchmarks such as RefCOCO  and RefCOCOg , with natural language referring expressions and images from the COCO dataset . However, many of the expressions in these datasets do not require resolving relations. Recently, a new CLEVR-Ref+ dataset  has been proposed for REF. It is built using the CLEVR environment and involves very complex queries, aiming to assess the reasoning capabilities of existing models and find their limitations.
In this work we tackle both VQA and REF tasks on three datasets in total. Notably, in all cases, we use the same approach, Language-Conditioned Graph Network (LCGN), to build contextualized representations of objects/image regions. This shows the generality and effectiveness of our approach for various visual reasoning tasks.
Graph networks and contextualized representations
Graph networks are powerful models that can perform relational inference through message passing [4, 9, 20, 22, 36, 44]. The core idea is to enable communication between image regions to build contextualized representations of these regions. Graph networks have been successfully applied to various tasks, from object detection  and region classification  to human-object interaction  and activity recognition 
. Besides, self-attention models and non-local networks  can also be cast as graph networks in a general sense. Below we review some of the recent works that rely on graph networks and other contextualized representations for VQA and REF.
A prominent work that introduced relational reasoning in VQA is , which proposes Relation Networks (RNs) for modeling relations between all pairs of objects, conditioned on a question.  extends RNs with the Broadcasting Convolutional Network module, which globally broadcasts objects’ visuo-spatial features. The first work to use graph networks in VQA is , which combines dependency parses of questions and scene graph representations of abstract scenes.  proposes modeling structured visual attention over a Conditional Random Field on image regions. A recent work, , conditions on a question to learn a graph representation of an image, capturing object interactions with the relevant neighbours via spatial graph convolutions. Later,  extends this idea to modeling spatial-semantic pairwise relations between all pairs of regions.
For the REF task,  proposes Language-guided Graph Attention Networks, where attention over nodes and edges is guided by a referring expression, which is decomposed into subject, intra-class and inter-class relationships.
Our work is related to, yet distinct from, the approaches above. While  predicts a sparsely connected graph (conditioned on the question) that remains fixed for each step of graph convolution, our LCGN model predicts dynamic edge weights to focus on different connections in each message passing iteration. Besides, 
is tailored to VQA and is non-trivial to adapt to REF (since it includes max-pooling over node representations). Compared to
, instead of max-pooling over explicitly constructed pairwise vectors, our model predicts normalized edge weights that both improve computation efficiency in message passing and make it easier to visualize and inspect connections. Finally, is tailored to REF by modeling specific subject attention and inter-and-extra class relations, and does not gather higher-order relational information in an iterative manner. We propose a more general approach for scene representation that is applicable to both VQA and REF.
A multitude of approaches have been recently proposed to tackle visual reasoning tasks, such as VQA and REF. Neural Module Networks (NMNs) [2, 14] are interpretable multi-step models that build question-specific layouts and execute them against an image. NMNs have also been applied to REF, e.g. the Compositional Modular Networks  and Stack-NMN . (The latter is a multi-task approach to VQA and REF.) An alternative approach, Memory, Attention, and Composition (MAC) , also performs multi-step reasoning while recording information in its memory. FiLM 
is an approach which modulates image representation with the given question via conditional batch normalization, and is extended in with Cascaded Mutual Modulation, a multi-step reasoning procedure where both modalities can modulate each other. The Neural-Symbolic approach  disentangles reasoning from image and language understanding, by first extracting symbolic representations from images and text, and then executing symbolic programs over them. MAttNet , a state-of-the-art approach to REF, is conceptually related to NMNs as it uses attention to parse an expression and ground it through subject, location and relation modules.
Our approach is not meant to substitute the aforementioned reasoning models, but to complement them. Our contextualized visual representation can be combined with other reasoning models to replace the local feature representation. A prominent reasoning model capable of addressing both VQA and REF is Stack-NMN , and we empirically compare to it in Section 4.
3 Language-Conditioned Graph Networks
Given a visual scene and a textual input for a reasoning task such as VQA or REF, we propose to construct a contextualized representation for each entity in the scene that contains the relational information needed for the reasoning procedure specified in the language input.
This contextualized representation is obtained in our novel Language-Conditioned Graph Networks (LCGN) model, through iterative message passing conditioned on the language input. It can be then used as input to a task-specific output module such as a single-hop VQA classifier.
3.1 Context-aware scene representation
For an image and a textual input that represents a reasoning task, let be the number of entities in the scene, where each entity can be a detected object or a spatial location on the convolutional feature map of the image. Let (where ) be the local feature representation of the -th entity, i.e. the -th detected object’s bounding box features or the convolutional features at the -th location on the feature grid. We would like to output a context-aware representation for each entity conditioned on the textual input that contains the relational context associated with entity . This is obtained through iterative message passing over iterations with our Language-Conditioned Graph Networks, as shown in Figure 2.
We use a fully-connected graph over the scene, where each node corresponds to an entity as defined above, and there is a directed edge between every pair of entities and . Each node is represented by a local feature that is fixed during message passing, and a context feature that is updated during each iteration . A learned parameter is used as the initial context representation at for all nodes, before the message passing starts.
Textual command extraction
To incorporate the textual input in the iterative message passing, we build a textual command vector for each iteration (where ). Given a textual input for the reasoning task, such as a question in VQA or a query in REF, we extract a set of vectors from the text , using the same multi-step textual attention mechanism as in Stack-NMN  and MAC . Specifically, is encoded into a sequence and a summary vector with a bi-directional LSTM as:
where is the number of words in , and is the concatenation of the forward and backward hidden states for word from the bi-directional LSTM output. At each iteration , a textual attention is computed over the words, and the textual command is obtained from the textual attention as follows:
where is element-wise multiplication. Each can be seen as a textual command supplied during the -th iteration. Unlike all other parameters that are shared across message passing iterations, here is learned separately for each iteration .
Language-conditioned message passing
At the -th iteration where , we first build a joint representation of each entity. Then, we compute the (directed) connection weights from every entity (the sender, ) to every entity (the receiver, ). Finally, each entity sends a message vector to each entity , and each entity sums up all of its incoming messages to update its contextual representation from to as described below.
Step 1. We build a joint representation for each node, by concatenating and and their element-wise product (after linear mapping) as
Step 2. We compute the directed connection weights from node (the sender) to node (the receiver), conditioning on the textual command at iteration . Here, the connection weights are normalized with a softmax function over , so that the sender weights sum up to for each receiver, for all as follows:
Step 3. Each node sends a message to each node conditioning on the textual input and weighted by the connection weight . Then, each node sums up the incoming messages and updates its context representation:
A naive implementation would involve pairwise vectors , which is inefficient for large . We implement it more efficiently by building an -row matrix containing unweighted messages in Eqn. 6, which is left multiplied by the edge weight matrix (where ) to obtain the sums in Eqn. 7 for all nodes in a single matrix multiplication. With this implementation, we can train our LCGN model efficiently with as large as 196 in our experiments.
We combine each entity’s local feature and context feature (after iterations) as its final representation :
The can be used as input to subsequent task-specific modules such as VQA or REF models, instead of the original local representation .
3.2 Application to VQA and REF
To apply our LCGN model to language-based reasoning tasks such as Visual Question Answering (VQA) and Referring Expression Comprehension (REF), we build simple task-specific output modules based on the language input and the contextualized representation of each entity. Our LCGN model and the subsequent task-specific modules are jointly trained end-to-end.
A single-hop answer classifier for VQA
The VQA task requires outputting an answer for an input image and a question . We adopt the commonly used classification approach and build a single-hop attention model as a classifier to select one of the possible answers from the training set.
First, the question is encoded into a vector with the Bi-LSTM in Eqn. 1. Then a single-hop attention is used over the objects to aggregate visual information, which is fused with to predict the score vector for each answer.
During training, a softmax classification loss is applied on the output scores for answer classification.
GroundeR  for REF
The REF task requires outputting a target bounding box as the grounding result for an input referring expression . Here, we use a retrieval approach as in previous works and select one target entity from the candidate entities in the scene (either object detection results or spatial locations on a convolutional feature map). To select the target object from the candidates, we encode expression to vector as in Eqn 1 and build a model similar to the fully-supervised version of GroundeR  to output a matching score for each entity . In the case of using spatial locations on a convolutional feature map, we further output a -dimensional vector to predict the bounding box offset from the feature grid location.
During training, we use a softmax loss over the scores among the candidates to select the target entity , and an L2 loss over the box offset to refine the box location.
We apply our LCGN model to two tasks – VQA and REF – for language-conditioned reasoning. For the VQA task, we evaluate on the GQA dataset  and the CLEVR dataset , which both require resolving relations between objects. For the REF task, we evaluate on the CLEVR-Ref+ dataset . In particular, the CLEVR and CLEVR-Ref+ datasets contain many complicated questions or expressions with higher-order relations, such as the ball on the left of the object behind a blue cylinder.
4.1 Visual Question Answering (VQA)
Evaluation on the GQA dataset
We first evaluate our LCGN model on the GQA dataset  for visual question answering. The GQA dataset is a large-scale visual question answering dataset with real images from the Visual Genome dataset  and balanced question-answer pairs. Each training and validation image is also associated with scene graph annotations describing the classes and attributes of those objects in the scene, and their pairwise relations. Along with the images and question-answer pairs, the GQA dataset provides two types of pre-extracted visual features for each image – convolutional grid features of size extracted from a ResNet-101 network 
trained on ImageNet, and object detection features of size(where is the number of detected objects in each image with a maximum of 100 per image) from a Faster R-CNN detector .
We apply our LCGN model together with the single-hop classifier (“single-hop + LCGN”) in Sec. 3.2 for answer prediction. We use rounds of message passing in our LCGN model, which takes approximately 20 hours to train using a single Titan Xp GPU. As a comparison to the context-aware representation from our LCGN model, we also train the single-hop classifier with only the local features in Eqn. 9 (“single-hop”).
|single-hop + LCGN (ours)||63.8%||55.6%||56.0%|
We first experiment with using the released object detection features in the GQA dataset as our local features , which is shown in  to perform better than the convolutional grid features, and compare with previous works.111We learned from the GQA dataset authors that its test-dev and test splits were collected differently from its train and val splits, with a noticeable domain shift causing a performance drop from val to test-dev and test. We train on the train split and report results on three GQA splits (val, test-dev and test). The performance of previous work on val was obtained from the dataset authors. The results are shown in Table 1. By comparing “single-hop + LCGN” with “single-hop” in the last two rows, it can be seen that our LCGN model brings over 4% (absolute) improvement in accuracy, indicating that our LCGN model facilitates reasoning by replacing the local features with the contextualized features containing rich relational information for the reasoning task. Figure 5 shows question answering examples from our model on this dataset.
We compare to three previous approaches in Table 1. CNN+LSTM  and Bottom-Up  are simple fusion approaches between the text and the image, using the released GQA convolutional grid features or object detection features respectively. The MAC model  is a multi-step attention and memory model with specially designed control, reading and writing cells, and is trained on the same object detection features as our model. Our approach outperforms the MAC model that performs multi-step inference, obtaining the state-of-the-art results on the GQA dataset.
|single-hop + LCGN||grid features||55.3%||49.5%|
|single-hop + LCGN||from detection||63.8%||55.6%|
|single-hop + LCGN||and attributes2||90.2%||n/a|
We further apply our LCGN model to other types of local features, and experiment with using either the same -dimensional convolutional grid features as used in CNN+LSTM in Table 1 (where each is a -dimensional vector at the -th spatial location and ) or an “oracle” symbolic local representation at both training and test time, based on a set of ground-truth objects along with their class and attribute annotations (“GT objects and attributes”) in the scene graph data of the GQA dataset. In the latter setting with symbolic representation, we construct two one-hot vectors to represent each object’s class and attributes, and concatenate them as each object’s .222In this setting, we can only evaluate on the val split with public scene graph annotations. We note that this is the only setting where we use the scene graphs in the GQA dataset. In all other settings, we only use the images and question-answer pairs to train our models. Also, our model does not rely on the GQA question semantic step annotations in any settings. The results are shown in Table 2, where our LCGN model delivers consistent improvements over all three types of local feature representations.
Evaluation on the CLEVR dataset
We also evaluate our LCGN model on the CLEVR dataset , a dataset for VQA with complicated relational questions, such as what number of other objects are there of the same size as the brown shiny object. Following previous works, we use the
convolutional grid features extracted from the C4 block of an ImageNet-pretrained ResNet-101 network as the local features on the CLEVR dataset (each is a 1024-dimensional vector and ).
Similar to our experiments on the GQA dataset, we apply our LCGN model together with the single-hop answer classifier and compare it with using only the local features in the answer classifier. We also compare to previous works that also use only question-answer pairs as supervision (without relying on the functional program annotations in ).
The results are shown in Table 3. It can be seen that the single-hop classifier only achieves 72.6% accuracy when using the local convolutional grid features (“single-hop”), which is unsurprising since the CLEVR dataset often involves resolving multiple and higher-order relations beyond the capacity of the single-hop classifier alone. However, when trained together with the context-aware representation from our LCGN model, this same single-hop classifier (“single-hop + LCGN”) achieves a significantly higher accuracy of 97.9% comparable to several state-of-the-art approaches on this dataset, showing that our LCGN model is able to embed relational context information in its output scene representation . Among previous works, Stack-NMN  and MAC  rely on multi-step inference procedures to predict an answer. RN  pools over all pairwise object-object vectors to collect relational information in a single step. FiLM  modulates the batch normalization parameters of a convolutional network with the input question. NS-CL  learns symbolic representations of the scene and uses quasi-logical reasoning. Except for Stack-NMN , most previous works are tailored to the VQA task, and it is non-trivial to apply them to other tasks such as REF, while our LCGN model provides a generic scene representation applicable to multiple tasks. Figure 5 shows question answering examples of our model.
|single-hop + LCGN (ours)||97.9%|
We further experiment with varying the number of message passing iterations in our LCGN model. In addition, to isolate the effect of conditioning on textual inputs during message passing, we also train and evaluate a restricted version of LCGN without text conditioning (“single-hop + LCGN w/o txt”), by replacing the ’s from Eqn 3 with a vector of all ones. The results are shown in Table 4, where it can be seen that using multiple rounds of iterations () leads to a significant performance increase, and it is crucial to incorporate the textual information into the message passing procedure. This is likely because the CLEVR dataset involves complicated questions that need multi-step context propagation. In addition, it is more efficient to collect the specific relational context relevant to the input question, instead of building a scene representation with a complete and unconditional knowledge base of all relational information that any input questions can query from.
|single-hop + LCGN||94.0%|
|single-hop + LCGN||94.5%|
|single-hop + LCGN||96.4%|
|single-hop + LCGN||97.9%|
|single-hop + LCGN||96.9%|
|single-hop + LCGN w/o txt||78.6%|
|single-hop + LCGN w/ static||96.5%|
Given that multi-round message passing () works better than using only a single round (), we further study whether it is beneficial to have dynamic connection weights in Eqn. 5 that can be different in each iteration to allow an object to focus on different context objects in different rounds. As a comparison, we train a restricted version of LCGN with static connection weights (“single-hop + LCGN w/ static ”), where we only predict the weights in Eqn. 5 for the first round , and reuse it in all subsequent rounds (setting for all ). From the last row of Table 4 it can be seen that there is a performance drop when restricting to static connection weights predicted only in the first round, and we also observe a similar (but larger) drop for the REF task in Sec. 4.2 and Table 5. This suggests that it is better to have dynamic connections during each iteration, instead of first predicting a fixed connection structure on which iterative message passing is performed ().
4.2 Referring Expression Comprehension (REF)
Our LCGN model provides a generic approach to building context-aware scene representations and is not restricted to a specific task such as VQA. We also apply our LGCN model to the referring expression comprehension (REF) task, where given a referring expression that describes an object in the scene, the model is asked to localize the target object with a bounding box.
We experiment with the CLEVR-Ref+ dataset , which contains similar images as in the CLEVR dataset  for VQA and complicated referring expressions requiring relation resolution. On the CLEVR-Ref+ dataset, we evaluate with the bounding box detection task in , where the output is a bounding box of the target object and there is only one single target object described by the expression. A localization is consider correct if it overlaps with the ground-truth box with at least 50% IoU. Same as in our VQA experiments on the CLEVR dataset in Sec. 4.1, here we also use the convolutional grid features from ResNet-101 C4 block as our local features ( is 1024-dimensional and ), with rounds of message passing. The final target bounding box is predicted with a 4-dimensional bounding box offset vector in Eqn. 13 from the selected grid location in Eqn. 12.
|GroundeR + LCGN w/o txt||65.0%|
|GroundeR + LCGN w/ static||71.4%|
|GroundeR + LCGN (ours)||74.8%|
We apply our LCGN model to build a context-aware representation conditioned on the input referring expression, which is used as input to our implementation of the GroundeR approach  (Sec. 3.2) for bounding box prediction (“GroundeR + LCGN”). As a comparison, we train and evaluate the GroundeR model without our context-aware representation (“GroundeR”), using local features as inputs in Eqn. 11. Similar to our experiments on the CLEVR dataset for VQA in Sec. 4.1, we also ablate our LCGN model with not conditioning on the input expression in message passing (“GroundeR + LCGN w/o txt”) or using static connection weights predicted from the first round (“GroundeR + LCGN w/ static ”).
The results are shown in Table 5, where our context-aware scene representation from LCGN leads to approximately 13% (absolute) improvement in REF accuracy. Consistent with our observation on the VQA task, for the REF task we find it important for the message passing procedure to depend on the input expression, and allowing the model to have dynamic connection weights that can differ for each round . Our model outperforms previous work by a large margin, achieving the state-of-the-art performance for REF on the CLEVR-Ref+ dataset. Figure 5 shows example predictions of our model on the CLEVR-Ref+ dataset.
In previous works, SLR  and MAttNet  are specifically designed for the REF task. SLR jointly trains an expression generation model (speaker) and an expression comprehension model (listener), while MAttNet relies on modular structure for subject, location and relation comprehension. While Stack-NMN  is also a generic approach that is applicable to both the VQA task and the REF task, the major contribution of Stack-NMN is to construct an explicit step-wise inference procedure with compositional modules, and it relies on hand-designed module structures and local appearance-based scene representations. On the other hand, our work augments the scene representation with rich relational context. We show that our approach outperforms Stack-NMN on both the VQA and the REF tasks.
|input image||single-hop attention|
question: is the fence in front of the elephant green and metallic? prediction: yes ground-truth: yes
question: the frisbee is on what animal? prediction: dog ground-truth: dog
|input image||single-hop attention|
question: what color is the matte ball that is the same size as the gray metal thing?
prediction: yellow ground-truth: yellow
question: how many other things are the same size as the yellow rubber ball? prediction: 3 ground-truth: 3
|input image||bounding box output|
referring expression: any other things that are the same shape as the big matte thing(s)
referring expression: the second one of the cube(s) from right
In this work, we propose Language-Conditioned Graph Networks (LCGN), a generic approach to language-based reasoning tasks such VQA and REF. Instead of building task-specific inference procedures, our LCGN model constructs rich context-aware representations of the scene through iterative message passing. Experimentally, we show that the context-aware representations from our LCGN model greatly improve over the local appearance-based representations across various types of local features and multiple datasets, and it is crucial for the message passing procedure to depend on the language inputs.
This work was partially supported by the Berkeley AI Research, NSF and DARPA XAI.
-  P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L. Zhang. Bottom-up and top-down attention for image captioning and vqa. arXiv preprint arXiv:1707.07998, 2017.
-  J. Andreas, M. Rohrbach, T. Darrell, and D. Klein. Neural module networks. In , 2016.
-  S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. Lawrence Zitnick, and D. Parikh. Vqa: Visual question answering. In Proceedings of the IEEE International Conference on Computer Vision, pages 2425–2433, 2015.
-  P. W. Battaglia, J. B. Hamrick, V. Bapst, A. Sanchez-Gonzalez, V. Zambaldi, M. Malinowski, A. Tacchetti, D. Raposo, A. Santoro, R. Faulkner, et al. Relational inductive biases, deep learning, and graph networks. arXiv preprint arXiv:1806.01261, 2018.
-  R. Cadene, H. Ben-younes, M. Cord, and N. Thome. Murel: Multimodal relational reasoning for visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019.
-  S. Chang, J. Yang, S. Park, and N. Kwak. Broadcasting convolutional network for visual relational reasoning. In Proceedings of the European Conference on Computer Vision (ECCV), pages 754–769, 2018.
-  X. Chen, L.-J. Li, L. Fei-Fei, and A. Gupta. Iterative visual reasoning beyond convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7239–7248, 2018.
A. Fukui, D. H. Park, D. Yang, A. Rohrbach, T. Darrell, and M. Rohrbach.
Multimodal compact bilinear pooling for visual question answering and
Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 2016.
J. Gilmer, S. S. Schoenholz, P. F. Riley, O. Vinyals, and G. E. Dahl.
Neural message passing for quantum chemistry.
Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1263–1272. JMLR. org, 2017.
-  Y. Goyal, T. Khot, D. Summers-Stay, D. Batra, and D. Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In CVPR, 2017.
-  K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
-  R. Herzig, E. Levi, H. Xu, E. Brosh, A. Globerson, and T. Darrell. Classifying collisions with spatio-temporal action graph networks. arXiv preprint arXiv:1812.01233, 2018.
-  R. Hu, J. Andreas, T. Darrell, and K. Saenko. Explainable neural computation via stack neural module networks. In Proceedings of the European Conference on Computer Vision (ECCV), pages 53–69, 2018.
-  R. Hu, J. Andreas, M. Rohrbach, T. Darrell, and K. Saenko. Learning to reason: End-to-end module networks for visual question answering. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2017.
-  R. Hu, M. Rohrbach, J. Andreas, T. Darrell, and K. Saenko. Modeling relationships in referential expressions with compositional modular networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1115–1124, 2017.
-  D. A. Hudson and C. D. Manning. Compositional attention networks for machine reasoning. In Proceedings of the International Conference on Learning Representation (ICLR), 2018.
-  D. A. Hudson and C. D. Manning. Gqa: a new dataset for compositional question answering over real-world images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019.
-  J. Johnson, B. Hariharan, L. van der Maaten, L. Fei-Fei, C. L. Zitnick, and R. Girshick. Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. In Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on, pages 1988–1997. IEEE, 2017.
-  D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
-  T. N. Kipf and M. Welling. Semi-supervised classification with graph convolutional networks. In International Conference on Learning Representations (ICLR), 2017.
-  R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. Kalantidis, L.-J. Li, D. A. Shamma, et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision, 123(1):32–73, 2017.
Y. Li, D. Tarlow, M. Brockschmidt, and R. Zemel.
Gated graph sequence neural networks.In International Conference on Learning Representations (ICLR), 2016.
-  T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014.
-  R. Liu, C. Liu, Y. Bai, and A. Yuille. Clevr-ref+: Diagnosing visual reasoning with referring expressions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019.
-  Y. Liu, R. Wang, S. Shan, and X. Chen. Structure inference net: object detection using scene-level context and instance-level relationships. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6985–6994, 2018.
-  J. Mao, C. Gan, P. Kohli, J. B. Tenenbaum, and J. Wu. The neuro-symbolic concept learner: Interpreting scenes, words, and sentences from natural supervision. In ICLR, 2019.
-  J. Mao, J. Huang, A. Toshev, O. Camburu, A. L. Yuille, and K. Murphy. Generation and comprehension of unambiguous object descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 11–20, 2016.
-  W. Norcliffe-Brown, S. Vafeias, and S. Parisot. Learning conditioned graph structures for interpretable visual question answering. In Advances in Neural Information Processing Systems, pages 8344–8353, 2018.
-  E. Perez, F. Strub, H. De Vries, V. Dumoulin, and A. Courville. Film: Visual reasoning with a general conditioning layer. In AAAI, 2018.
-  S. Qi, W. Wang, B. Jia, J. Shen, and S.-C. Zhu. Learning human-object interactions by graph parsing neural networks. In Proceedings of the European Conference on Computer Vision (ECCV), pages 401–417, 2018.
-  S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pages 91–99, 2015.
-  A. Rohrbach, M. Rohrbach, R. Hu, T. Darrell, and B. Schiele. Grounding of textual phrases in images by reconstruction. In European Conference on Computer Vision, pages 817–834. Springer, 2016.
-  A. Santoro, D. Raposo, D. G. Barrett, M. Malinowski, R. Pascanu, P. Battaglia, and T. Lillicrap. A simple neural network module for relational reasoning. In Advances in neural information processing systems, pages 4974–4983, 2017.
-  D. Teney, L. Liu, and A. van den Hengel. Graph-structured representations for visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1–9, 2017.
-  A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems, pages 5998–6008, 2017.
-  P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Lio, and Y. Bengio. Graph attention networks. In International Conference on Learning Representations, 2018.
-  P. Wang, Q. Wu, J. Cao, C. Shen, L. Gao, and A. v. d. Hengel. Neighbourhood watch: Referring expression comprehension via language-guided graph attention networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019.
-  X. Wang, R. Girshick, A. Gupta, and K. He. Non-local neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7794–7803, 2018.
-  Y. Yao, J. Xu, F. Wang, and B. Xu. Cascaded mutual modulation for visual reasoning. In Conference on Empirical Methods in Natural Language Processing (EMNLP), 2018.
-  K. Yi, J. Wu, C. Gan, A. Torralba, P. Kohli, and J. Tenenbaum. Neural-symbolic vqa: Disentangling reasoning from vision and language understanding. In Advances in Neural Information Processing Systems, pages 1039–1050, 2018.
-  L. Yu, Z. Lin, X. Shen, J. Yang, X. Lu, M. Bansal, and T. L. Berg. Mattnet: Modular attention network for referring expression comprehension. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1307–1315, 2018.
-  L. Yu, P. Poirson, S. Yang, A. C. Berg, and T. L. Berg. Modeling context in referring expressions. In European Conference on Computer Vision, pages 69–85. Springer, 2016.
-  L. Yu, H. Tan, M. Bansal, and T. L. Berg. A joint speaker-listener-reinforcer model for referring expressions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7282–7290, 2017.
-  J. Zhou, G. Cui, Z. Zhang, C. Yang, Z. Liu, and M. Sun. Graph neural networks: A review of methods and applications. arXiv preprint arXiv:1812.08434, 2018.
-  C. Zhu, Y. Zhao, S. Huang, K. Tu, and Y. Ma. Structured attentions for visual question answering. In Proceedings of the IEEE International Conference on Computer Vision, pages 1291–1300, 2017.
Appendix A Implementation details
In our implementation, we use as the dimensionality for the textual vectors (such as , , and ), and as the dimensionality for the context features of each entity . On the GQA dataset, we first reduce the dimensionality of the input local features (convolutional grid features, object detection features or GT objects and attributes in Table 2 of the main paper) to the same dimensionality with a single fully-connected layer (without non-linearity). During training, we use the Adam optimizer  with a batch size of and a learning rate of . On the CLEVR dataset and the CLEVR-Ref+ dataset, we first apply a small two-layer convolutional network on the ResNet-101-C4 features to output a feature map, so that the feature dimensionality at each location on the feature map is also reduced to . We train with the Adam optimizer  using a batch size of and a learning rate of .
The shapes of the parameters in our Language-Conditioned Graph Networks (LCGN) and task-specific output modules are shown in Table 6. All our models are trained using a single Titan Xp GPU.
|(textual command extraction)|
|(language-conditioned message passing)|
|(the single-hop answer classifier for VQA)|
|(GroundeR for REF)|
Appendix B Additional visualization examples
Figures 6 and 7 show additional visualization examples for the VQA task on the GQA dataset and the CLEVR dataset, respectively. Figure 8 shows additional examples for the REF task on the CLEVR-Ref+ dataset.
|input image||single-hop attention|
question: are there carts near the pond? prediction: yes ground-truth: yes
question: what color is the flag? prediction: white ground-truth: white
question: what type of vehicle is in front of the hanging wires? prediction: train ground-truth: train
question: on what does the man sit? prediction: bench ground-truth: bench
question: are there both a tennis ball and a racket in the image? prediction: yes ground-truth: yes
question: what vehicle is on the highway? prediction: truck ground-truth: ambulance
question: who is holding the umbrella? prediction: woman ground-truth: lady
|input image||single-hop attention|
question: there is a small gray block ; are there any spheres to the left of it? prediction: yes ground-truth: yes
question: is the purple thing the same shape as the large gray rubber thing? prediction: no ground-truth: no
question: do the large metal sphere and the matte block have the same color? prediction: yes ground-truth: yes
question: is there anything else that has the same material as the red thing? prediction: yes ground-truth: yes
question: is there any other thing that is the same color as the cylinder? prediction: no ground-truth: no
question: what number of other objects are there of the same size as the gray sphere? prediction: 5 ground-truth: 5
question: is the number of small cylinders behind the cyan thing greater than the number of cubes that are behind the green block? prediction: yes ground-truth: no
question: how many other objects are the same shape as the purple metallic thing? prediction: 6 ground-truth: 7
|input image||bounding box output|
referring expression: any other yellow shiny objects that have the same size as the first one of the objects from front
referring expression: any other tiny objects that have the same material as the third one of the objects from left
referring expression: the second one of the things from left
referring expression: any other matte things that have the same shape as the first one of the red metal things from right
referring expression: the first one of the things from front that are on the right side of the first one of the purple spheres from front
referring expression: the second one of the shiny objects from front
referring expression: any other matte things of the same shape as the fifth one of the rubber things from right
referring expression: look at sphere that is right of the first one of the things from front; the second one of the objects from right that are in front of it