Understanding the full semantics of rich visual scenes is a complex task that involves detecting individual entities, as well as reasoning about the joint combination of the entities and the relations between them. To represent entities and their relations jointly, it is natural to view them as a graph, where nodes are entities and edges represent relations. Such representations are often called Scene Graphs (SGs) . Because SGs allow to explicitly reason about images, substantial efforts have been made to infer them from raw images [15, 16, 40, 26, 46, 11, 48].
While scene graphs have been shown to be useful for various tasks [15, 16, 13], using them as a component in a visual reasoning system is challenging: (a) Because scene graphs are discrete and non-differentiable, it is difficult to learn them end-to-end from a downstream task. (b) The alternative is to pre-train SG predictors separately from supervised data, but this requires arduous and prohibitive manual annotation. Moreover, pre-trained SG predictors have low coverage, because the set of labels they are pre-trained on rarely fits the needs of a downstream task. For example, given an image of a parade and a question “point to the officer on the black horse”, that horse might not be a node in the graph, and the term “officer” might not be in the vocabulary. Given these limitations, it is an open question how to make scene graphs useful for visual reasoning applications.
In this work, we describe Differentiable Scene-Graphs (DSG), which address the above challenges (Figure 1). DSGs are an intermediate representation trained end-to-end from the supervision for a downstream reasoning task. The key idea is to relax the discrete properties of scene graphs such that each entity and relation is described with a dense differentiable descriptor.
We demonstrate the benefits of DSGs in the task of resolving referring relationships (RR)  (see Figure 1). Here, given an image and a triplet query subject, relation, object, a model has to find the bounding boxes of the subject and object that participate in the relation.
We train an RR model with DSGs as an intermediate component. As such, DSGs are not trained with direct supervision about entities and relations, but using several supervision signals about the downstream RR task. We evaluate our approach on three standard RR datasets: Visual Genome , VRD  and CLEVR , and find that DSGs substantially improve performance compared to state-of-the-art approaches [27, 21].
To conclude, our novel contributions are: (1) A new Differentiable Scene-Graph representation for visual reasoning, which captures information about multiple entities in an image and their relations. We describe how DSGs can be trained end-to-end with a downstream visual reasoning task without direct supervision of pre-collected scene-graphs. (2) A new architecture for the task of referring relationships, using a DSG as its central component. (3) New state-of-the-art results on the task of referring relationships on the Visual Genome, VRD and CLEVR datasets.
2 Referring Relationship: The Learning Setup
In the referring relationship task  we are given an image and a subject-relation-object query . The goal is to output a bounding box for the subject, and another bounding box for the object. In practice sometimes there are several boxes for each. See Fig. 1 for a sample query and expected output.
Following , we focus on training a referring relationship predictor from labeled data. Namely, we use a training set consisting of images, queries and the correct boxes for these queries. We denote these by . As in , we assume that the vocabulary of query components (subject, object and relation) is fixed.
In our model, we break this task into two components that we optimize in parallel. We fine-tune the position of bounding boxes such that they cover entities tightly, and we also label these boxes as one of the following four possible labels. The labels “Subject” and “Object” disambiguate between the ’s’ and ’o’ entities in the query. The label “Other” refers to boxes corresponding to additional entities, not mentioned in the query, and the label “Background” refers cases where the box does not describe an entity. We refer to these two optimization goals as Box Refiner and Referring Relationships Classifier.
3 Differentiable Scene Graphs
We start by discussing the motivation and potential advantages of using intermediate scene-graph-like representations, as compared to standard scene graphs. Then, we explain how DSGs fit into the full architecture of our model.
3.1 Why use intermediate DSG layers?
A scene graph (SG) represents entities and relations in an image as a set of nodes and edges. A “perfect” SG (representing all entities and relations) captures most of the information needed for visual reasoning, and thus should be useful as an intermediate representation. Such a SG can then be used by downstream reasoning algorithms, using the predicted SG as an input. Unfortunately, learning to predict “perfect” scene graphs for any downstream task is unlikely due to the aforementioned challenges: First, there is rarely enough data to train good SG predictors, and second, learning to predict SGs in a way that is independent of the downstream task, tends to yield less relevant SGs.
Instead, we propose an intermediate representation, termed a “Differentiable Scene Graph” layer (DSG), which captures the relational information as in a scene graph but can be trained end-to-end in a task-specific manner (Fig. 2
). Like SGs, a DSG keeps descriptors for visual entities and their relations. Unlike SGs, whose nodes and edges are annotated by discrete values (labels), a DSG contains a dense distributed representation vector for each detected entity (termednode descriptor) and each pair of entities (termed edge descriptor). These representations are themselves learned functions of the input image, as we explain in the supplemental material. Like SGs, a DSG only describes candidate boxes which cover entities of interests and their relations. Unlike SGs, each DSG descriptor encompasses not only the local information about a node, but also information about its context. Most importantly, because DSGs are differentiable, they are used as input to downstream visual-reasoning modules, in our case, a referring relationships module.
DGSs provide several computational and modelling advantages:
Differentiability. Because node and edge descriptors are differentiable functions of detected boxes, and are fed into a differentiable reasoning module, the entire pipeline can be trained with gradient descent.
Dense descriptors. By keeping dense descriptors for nodes and edges, the DSG keeps more information about possible semantics of nodes and edges, instead of committing too early to hard sparse representations. This allows it to better fit downstream tasks.
Supervision using downstream tasks. Collecting supervised labels for training scene graphs is hard and costly. DGSs can be trained using training data that is available for downstream tasks, saving costly labeling efforts. With that said, when labeled scene graphs are avilable for given images, that data can be used when training the DSG, using an added loss component.
DSG descriptors are computed by integrating global information from the entire image using graph neural networks (see supplemental materials). Combining information across the image increases the accuracy of object and relation descriptors.
3.2 The DSG Model for Referring relationships
We now describe how DSGs can be combined with other modules to solve a visual reasoning task. The architecture of the model is illustrated in Fig. 2. First, the model extracts bounding boxes for entities and relations in the image. Next, it creates a differentiable scene-graph over these bounding boxes. Then, DSG features are used by two output modules, aimed at answering a referring-relationship query: a Box Refiner module that refines the bounding box of the relevant entities, and an Referring Relationships Classifier module that classifies each box as Subject, Object, Other or Background. We now describe these components in more detail.
Object Detector. We detect candidate entities using a standard region proposal network (RPN) , and denote their bounding boxes by ( may vary between images). We also extract a feature vector for each box and concatenate it with the box coordinates, yielding . See details in the supplemental material
Relation Feature Extractor. Given any two bounding boxes and we consider the smallest box that contains the two boxes (their “union” box). We denote this “relation box” by and its features by . Finally, we denote the concatenation of the features and box coordinates by .
Differentiable Scene-Graph Generator. As discussed above, the goal of the DSG Generator is to transform the above features and into differentiable representations of the underlying scene graph. Namely, map these features into a new set of dense vectors and representing entities and relations. This mapping is intended to incorporate the relevant context of each feature vector. Namely, the representation contains information about the entity, together with its image-wide context.
There are various possible approaches to achieve this mapping. Here we use the model proposed by , which uses a graph neural network for this transformation. See supplemental materials for details on this network.
Multi-task objective. In many domains, training with multi-task objectives can improve the accuracy of individual tasks, because auxiliary tasks operate as regularizers, pushing internal representations away from overfitting and towards capturing useful properties of the input. We follow this idea here and define a multi-task objective that has three components: (a) a Referring Relationships Classifier matches boxes to subject and object query terms. (b) A Box Refiner predicts accurate tight bounding boxes. (c) A Box Labeler recognizes visual entities in boxes if relevant ground truth is available. We also tune an object detector RPN network producing box proposals for our model.
Fig. 3 illustrates the effect of the first two components, and how they operate together to refine the bounding boxes and match them to the query terms. Specifically, Fig. 3c, shows how boxes refinement produces boxes that are tight around objects and subjects, and Fig. 3d shows how RR classification matches boxes to query terms.
(A) Referring Relationships Classifier. Given a DSG representation, we use it for answering referring relationship queries. Recall that the output of an RR query subject, relation, object should be bounding boxes containing subjects and objects that participate in the query relation. Our model has already computed bounding boxes , as well as representations for each box. We next use a prediction model that takes as input features describing a bounding box and the query, and outputs one of four labels Subject, Object, Other, Background where Other refers to a bounding box which is not the query Subject or Object and Background
refers to a false entity proposal. Denote the logits generated by this classifier for thebox by . The output set (or ) is simply the set of bounding boxes classified as Subject (or Object). See supplemental materials for further implementation details.
(B) Box Refiner. The DSG is also used for further refinement of the bounding-boxes generated by the RPN network. The idea is that additional knowledge about image context can be used to improve the coordinates of a given entity. This is done via a network that takes as input the RPN box coordinates and a differentiable representation for box , and outputs new bounding box coordinates. See Fig. 3 for an illustration of box refinement, and the supplemental material for further implementation details.
(C) Optional auxiliary losses: Scene-Graph Labeling. In addition to the Box Refiner and Referring Relationships Classifier modules described above, one can also use supervision about labels of entities and relations if these are available at training time. Specifically, we train an object-recognition classifier operating on boxes, which predicts the label of every box for which a label is available. This classifier is trained as an auxiliary loss, in a multi-task fashion, and is described in detail below.
4 Training with Multiple Losses
We next explain how our model is trained for the RR task, and how we can also use the RR training data for supervising the DSG component. We train with a weighted sum of three losses: (1) Referring Relationships Classifier (2) Box Refiner (3) Optional Labeling loss. We now describe each of these components. Additional details are provided in the supplemental material.
4.1 Referring Relationship Classification Loss
The Referring Relationships Classifier (Sec. 3.2) outputs logits for each box, corresponding to its prediction (subject, object, etc.). To train these logits, we need to extract their ground-truth values from the training data. Recall that a given image in the training data may have multiple queries, and so may have multiple boxes that have been tagged as subject or object for the corresponding queries. To obtain the ground-truth for box and query we take the following steps. First, we find the ground-truth box that has maximal overlap with box . If this box is either a subject or object for the query , we set to be Subject or Object respectively. Otherwise, if the overlap with a ground-truth box for a different image-query is greater than , we set (since it means there is some other entity in the box), and we set if the overlap is less than . If the overlap is in we do not use the box for training. For instance, given a query woman, feeding, giraffe with ground-truth boxes for “woman” and “giraffe”, consider the box in the RPN that is closest to the ground-truth box for “woman”. Assume the index of this box is . Similarly, assume that the box closest to the ground-truth for “giraffe’ has index . We would have , and the rest of the values would be either Other or Background. Given these ground truth values, the Referring Relationship Classifier Loss is simply the sum of cross entropies between the logits and the one-hot vectors corresponding to .
4.2 Box Refiner Loss
To train the Box Refiner, we use a smooth loss between the coordinates of the refined (predicted) boxes and their ground truth ones.
4.3 Scene-Graph Labeling Loss
When ground-truth data about entity labels is available, we can use it as an additional source of supervision to train the DSG. Specifically, we train two classifiers. A classifier from features of entity boxes to the set of entity labels, and a classifier from features of relation boxes to relation labels. We then add a loss to maximize the accuracy of these classifiers with respect to the ground truth box labels.
4.4 Tuning the Object Detector
In addition to training the DSG and its downstream visual-reasoning predictors, the object detector RPN is also trained. The output of the RPN is a set of bounding boxes. The ground-truth contains boxes that are known to contain entities. The goal of this loss is to encourage the RPN to include these boxes as proposals. Concretely, we use a sum of two losses: First, a RPN classification loss, which is a cross entropy over RPN anchors where proposals of 0.8 overlap or higher with the ground truth boxes were considered as positive. Second, an RPN box regression loss which is a smooth L1 loss between the ground-truth boxes and proposal boxes.
: Results, including standard error for DSG variants on the validation set of Visual Genome dataset. DSG values slightly differ from Table 1 which computed IOU on the test set. The various models are described in Sec.6.2.
In the following sections we provide details about the datasets, training, baselines models, evaluation metrics, model ablations and results. Due to space consideration, the implementation details of the model are provided in in the supplemental material.
We evaluate the model in the task of referring relationships across three datasets, each exhibiting a unique set of characteristics and challenges.
CLEVR . A synthetic dataset generated from scene-graphs with four spatial relations: “left”, “right”, “front” and “behind”, and 48 entity categories. It has over 5M relationships where 33% are ambiguous entities (multiple entities of the same type in an image).
VRD . The Visual Relationship Detection dataset contains 5,000 images with 100 entity categories and 70 relation categories. In total, VRD contains 37,993 relationship annotations with 6,672 unique relationship types and 24.25 relations per entity category. 60.3% of these relationships refer to ambiguous entities.
Visual Genome . VG is the largest public corpus for visual relationships in real images, with 108,077 images annotated with bounding boxes, entities and relations. On average, images have 12 entities and 7 relations per image. In total, there are over 2.3M relationships where 61% of those refer to ambiguous entities.
5.2 Evaluation Metrics
We compare our model to previous work using the average IOU for subjects and for objects. To compute the average subject IOU, we first generate two binary attention maps: one that includes all the ground truth boxes labeled as Subject (recall that few entities might be labeled as Subject) and the other includes all the box proposals predicted as Subject. If no box is predicted as Subject, the box with the highest score for the label Subject is included in the predicted attention map. We then compute the Intersection-Over-Union between the binary attention maps. For a proper comparison with previous work , we use . The object boxes are evaluated in the exact same manner.
The Referring Relationship task was introduced recently , and the SSAS model was proposed as a possible approach (see below). We report the results for the baseline models in . When evaluating our Differentiable Scene-Graph model, we use exactly the evaluation setting as in  (i.e., same data splits, entity and relation categories). The baselines reported are:
Symmetric Stacked Attention Shifting (SSAS):  An iterative model that localizes the relationship entities using attention shift component learned for each relation.
Spatial Shifts : Same as SSAS, but with no iterations and by replacing the shift attention mechanism with statistically learned shift per relation that ignores the semantic meaning of entities.
Co-Occurrence : Uses an embedding of the subject and object pair for attending over the image features.
Visual Relationship Detection (VRD) : Similar to Co-Occurrences model, but with an additional relationship embedding.
Table 1 provides average IOU for Subject and Object over the three datasets described in Sec. 5.1. We compare our model to four baselines described in Sec. 5.3. Our Differentiable Scene-Graph approach outperforms all baselines in terms of the average IOU.
Our results for the CLEVR dataset are significantly better than those in . Because CLEVR objects have a small set of distinct colors (Fig 5), object detection in CLEVR is much easier than in natural images, making it easier to achieve high IOU. The baseline model without the DSG layer (no-DSG) is an end-to-end model with a two-stage detector in contrast to  and already improves strongly over prior work with 93.7%, and our novel DSG approach further improves to 96.3% (reducing error by 50%).
6.1 Analysis of success and failure cases.
Fig. 4 shows example of success cases and failure cases. We further analyzed the types of common mistakes and their distribution. Since DSGs depend on box proposals, they are sensitive to the quality of the object detectors. Manual inspection of images revealed four main error types: (1) 30%: Detector failed: the relevant box is missing from the box proposal list. (2) 23.3% Subject or Object detected but classified as Other or as Background. (3) 16.6%: Relation misclassified. The entities classified as Subject and Object match the query, but without the required relation. (4) 16.6%: Multiplicity. Either too few or too many of the GT boxes are classified as Subject or Object. (5) 13.3%: Other, including incorrect GT, and hard-to-decide cases.
6.2 Model Ablations
We explored the power of DSGs through model ablations. First, since the model is trained with three loss components, we quantify the contribution of the Box Refinement loss and the Scene-Graph Labeling loss (it is not possible to omit the Referring Relationships Classifier loss). We further evaluate the contribution of the DSG compared with a two-step approach which first predicts an SG, and then reasons over it. We compare the following models:
Two steps: Two-step model. We first predict a scene-graph, and then match the query with the SG. The SG predictor consists of the same components used in the DSG: A box detector, DSG dense descriptors, and an SG labeler. It is trained with the same set of SG labels used for training the DSG. Details in the supplemental material.
DSG -SGL: DSG without the Scene-Graph Labeling component described in Sec. 4.3).
DSG -BR: DSG where the Box Refiner component of Section 4.2 is replaced with fine tuning the coordinates of the box proposal using the visual features extracted by the Object Detector. This variant allows us to quantify the benefit of refining the box proposals based on the differentiable representation of the scene.
no-DSG: A baseline model that does not use the DSG representations. Instead, the model includes only an Object Detector and a referring relationship classifier. The referring relationship classifier uses the features extracted by the Object Detector instead of the
features. This model allows us to quantify the benefit of the differentiable scene representation for referring relationship classification.
Table 2 provides results of ablation experiments for the Visual Genome dataset  on the validation set. All model variants based on scene representation perform better than the model that does not use the DSG representation (i.e., DSG -SG) in terms of average IOU over subject and object, demonstrating the power of contextualized scene representation.
The DSG model outperforms all model ablations, illustrating the improvements achieved by using partial supervision for training the differentiable scene-graph. Fig. 7 illustrates the the effect of ablating various components of the model.
6.3 Inferring SGs from DSGs
While the DSG layer is not designed for predicting scene graphs from images, it can be be used for inferring scene graphs. We decoded SG nodes that are contained in the RR vocabulary by constructing two classifiers: A 1-layer classifier mapping the node to logits over the entity vocabulary, and a 1-layer classifier that maps the edge descriptors to logits over the relation vocabulary. Fig. 6 illustrates the result of this inference, showing a Scene-Graph inferred from the DSG, trained using this loss. The predicted graph is indeed largely correct despite the fact that it was not directly trained for this task.
We further analyzed the accuracy of predicted SGs by comparing to ground-truth SG on visual genome (complete SGs were not used for training, only for analysis in this rebuttal). SGs decoded from DSGs achieve accuracy of for object labels and for relations (calculated for proposals with IOU ).
7 Related Work
Graph Neural Networks. Recently, major progress has been made in constructing graph neural networks (GNN). These refer to a class of neural networks that operate directly on graph-structured data by passing local messages [7, 24]. Variants of GNNs have been shown to be highly effective at relational reasoning tasks , classification of graphs [2, 3, 31, 4], and classification of nodes in large graphs [18, 9]. The expressive power of GNNs has also been studied in [11, 45]. GNNs have also been applied to visual understanding in [11, 41, 39, 10] and control [37, 1]. Similar aggregation schemes have also been applied to object detection .
Visual Relationships. Earlier work aimed to leverage visual relationships for improving detection 
, action recognition and pose estimation, semantic image segmentation  or detection of human-object interactions [42, 32, 25]. Lu et al.  were the first to formulate detection of visual relationships as a separate task. They learn a likelihood function that uses a language prior based on word embeddings for scoring visual relationships and constructing scene graphs, while other recent works proposed better methods for relationships detection [19, 47].
Scene Graphs. Scene graphs provide a compact representation of the semantics of an image, and have been shown to be useful for semantic-level interpretation and reasoning about a visual scene . Extracting scene graphs from images provides a semantic representation that can later be used for reasoning, question answering 
, and image retrieval[15, 34].
Previous scene-graph prediction work used attention  or neural message passing .  suggested to predict graphs directly from pixels in an end-to-end manner. NeuralMotif  considers global context using an RNN by reading sequentially the independent predictions for each entity and relation and then refines those predictions.
Referring Relationships. Several recent studies looked into the task of detecting an entity based on a referring expression [17, 20], while taking context into account.  described a model that has two parts: one for generating expressions that point to an entity in a discriminative fashion and a second for understanding these expressions and detecting the referred entity.  explored the role of context and visual comparison with other entities in referring expressions.
Modelling context was also the focus of , using a multi-instance-learning objective. Recently,  introduced an explicit iterative model that localizes the two entities in the referring relationship task, conditioned on one another using attention from one entity to another. However, in contrast to this work, we show an implicit model that uses latent scene context, resulting in new state of the art results on three vision datasets that contain visual relationships.
This work is motivated by the assumption that accurate reasoning about images may require access to a detailed representation of the image. While scene graphs provide a natural structure for representing relational information, it is hard to train very dense SGs in a fully supervised manner, and for any given image, the resulting SGs may not be appropriate for downstream reasoning tasks. Here we advocate DSGs, an alternative representation that captures the information in SGs, which is continuous and can be trained jointly with downstream tasks. Our results, both qualitative (Fig 4 ) and quantitative (Table 1,2), suggest that DSGs effectively capture scene structure, and that this can be used for down-stream tasks such as referring relationships.
One natural next step is to study such representations in additional downstream tasks that require integrating information across the image. Some examples are caption generation and visual question answering. DSGs can be particularly useful for VQA, since many questions are easily answerable by scene graphs (e.g., counting questions and questions about relations). Another important extension to DSGs would be a model that captures high-order interactions, as in a hyper-graph. Finally, it will be interesting to explore other approaches to training the DSG, and in particular finding ways for using unlabeled data for this task.
-  P. W. Battaglia, J. B. Hamrick, V. Bapst, A. Sanchez-Gonzalez, V. F. Zambaldi, M. Malinowski, A. Tacchetti, D. Raposo, A. Santoro, R. Faulkner, Ç. Gülçehre, F. Song, A. J. Ballard, J. Gilmer, G. E. Dahl, A. Vaswani, K. Allen, C. Nash, V. Langston, C. Dyer, N. Heess, D. Wierstra, P. Kohli, M. Botvinick, O. Vinyals, Y. Li, and R. Pascanu. Relational inductive biases, deep learning, and graph networks. CoRR, abs/1806.01261, 2018.
-  J. Bruna and S. Mallat. Invariant scattering convolution networks. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 35(8):1872–1886, 2013.
-  H. Dai, B. Dai, and L. Song. Discriminative embeddings of latent variable models for structured data. In Int. Conf. Mach. Learning, pages 2702–2711, 2016.
-  M. Defferrard, X. Bresson, and P. Vandergheynst. Convolutional neural networks on graphs with fast localized spectral filtering. In Neural Inform. Process. Syst., pages 3837–3845, 2016.
-  C. Desai and D. Ramanan. Detecting actions, poses, and objects with relational phraselets. In ECCV, pages 158–172, 2012.
-  C. Galleguillos, A. Rabinovich, and S. Belongie. Object categorization using co-occurrence, location and appearance. In , pages 1–8, June 2008.
-  J. Gilmer, S. S. Schoenholz, P. F. Riley, O. Vinyals, and G. E. Dahl. Neural message passing for quantum chemistry. arXiv preprint arXiv:1704.01212, 2017.
-  A. Gupta and L. S. Davis. Beyond nouns: Exploiting prepositions and comparative adjectives for learning visual classifiers. In ECCV, pages 16–29, 2008.
-  W. Hamilton, Z. Ying, and J. Leskovec. Inductive representation learning on large graphs. In Neural Inform. Process. Syst., pages 1024–1034, 2017.
-  R. Herzig, E. Levi, H. Xu, E. Brosh, A. Globerson, and T. Darrell. Classifying collisions with spatio-temporal action graph networks. arXiv preprint arXiv:1812.01233, 2018.
-  R. Herzig, M. Raboh, G. Chechik, J. Berant, and A. Globerson. Mapping images to scene graphs with permutation-invariant structured prediction. In Advances in Neural Information Processing Systems (NIPS), 2018.
-  H. Hu, J. Gu, Z. Zhang, J. Dai, and Y. Wei. Relation networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3588–3597, 2018.
-  J. Johnson, A. Gupta, and L. Fei-Fei. Image generation from scene graphs. arXiv preprint arXiv:1804.01622, 2018.
-  J. Johnson, B. Hariharan, L. van der Maaten, L. Fei-Fei, C. L. Zitnick, and R. B. Girshick. CLEVR: A diagnostic dataset for compositional language and elementary visual reasoning. CoRR, 2016.
-  J. Johnson, R. Krishna, M. Stark, L. Li, D. A. Shamma, M. S. Bernstein, and F. Li. Image retrieval using scene graphs. In Proc. Conf. Comput. Vision Pattern Recognition, pages 3668–3678, 2015.
-  J. Johnson, R. Krishna, M. Stark, L.-J. Li, D. Shamma, M. Bernstein, and L. Fei-Fei. Image retrieval using scene graphs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3668–3678, 2015.
S. Kazemzadeh, V. Ordonez, M. Matten, and T. Berg.
Referitgame: Referring to objects in photographs of natural scenes.
Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 787–798, 2014.
-  T. N. Kipf and M. Welling. Semi-supervised classification with graph convolutional networks. CoRR, abs/1609.02907, 2016.
-  A. Kolesnikov, C. H. Lampert, and V. Ferrari. Detecting visual relationships using box attention. arXiv preprint arXiv:1807.02136, 2018.
-  E. Krahmer and K. Van Deemter. Computational generation of referring expressions: A survey. Computational Linguistics, 38(1):173–218, 2012.
-  R. Krishna, I. Chami, M. Bernstein, and L. Fei-Fei. Referring relationships. In IEEE Conference on Computer Vision and Pattern Recognition, 2018.
-  R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. Kalantidis, L.-J. Li, D. A. Shamma, M. Bernstein, and L. Fei-Fei. Visual genome: Connecting language and vision using crowdsourced dense image annotations. ArXiv e-prints, 2016.
-  D. LaBerge, R. Carlson, J. Williams, and B. Bunney. Shifting attention in visual space: tests of moving-spotlight models versus an activity-distribution model. Journal of experimental psychology. Human perception and performance, 23(5):1380—1392, October 1997.
-  Y. Li, D. Tarlow, M. Brockschmidt, and R. S. Zemel. Gated graph sequence neural networks. CoRR, abs/1511.05493, 2015.
-  Y.-L. Li, S. Zhou, X. Huang, L. Xu, Z. Ma, H.-S. Fang, Y. Wang, and C. Lu. Transferable interactiveness knowledge for human-object interaction detection. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.
-  W. Liao, M. Y. Yang, H. Ackermann, and B. Rosenhahn. On support relations and semantic scene graphs. arXiv preprint arXiv:1609.05834, 2016.
-  C. Lu, R. Krishna, M. S. Bernstein, and F. Li. Visual relationship detection with language priors. In European Conf. Comput. Vision, pages 852–869, 2016.
-  J. Mao, J. Huang, A. Toshev, O. Camburu, A. L. Yuille, and K. Murphy. Generation and comprehension of unambiguous object descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 11–20, 2016.
-  V. K. Nagaraja, V. I. Morariu, and L. S. Davis. Modeling context between objects for referring expression understanding. In European Conference on Computer Vision, pages 792–807. Springer, 2016.
-  A. Newell and J. Deng. Pixels to graphs by associative embedding. In Advances in Neural Information Processing Systems 30 (to appear), pages 1172–1180. Curran Associates, Inc., 2017.
-  M. Niepert, M. Ahmed, and K. Kutzkov. Learning convolutional neural networks for graphs. In Int. Conf. Mach. Learning, pages 2014–2023, 2016.
-  B. A. Plummer, A. Mallya, C. M. Cervantes, J. Hockenmaier, and S. Lazebnik. Phrase localization and visual relationship detection with comprehensive image-language cues. In ICCV, pages 1946–1955, 2017.
-  M. Qi, W. Li, Z. Yang, Y. Wang, and J. Luo. Attentive relational networks for mapping images to scene graphs. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.
-  D. Raposo, A. Santoro, D. Barrett, R. Pascanu, T. Lillicrap, and P. Battaglia. Discovering objects and their relations from entangled scene representations. arXiv preprint arXiv:1702.05068, 2017.
-  S. Ren, K. He, R. B. Girshick, and J. Sun. Faster R-CNN: towards real-time object detection with region proposal networks. CoRR, abs/1506.01497, 2015.
-  M. A. Sadeghi and A. Farhadi. Recognition using visual phrases. In Computer Vision and Pattern Recognition (CVPR), 2011.
-  A. Sanchez-Gonzalez, N. Heess, J. T. Springenberg, J. Merel, M. A. Riedmiller, R. Hadsell, and P. Battaglia. Graph networks as learnable physics engines for inference and control. In ICML, pages 4467–4476, 2018.
-  A. Santoro, D. Raposo, D. G. Barrett, M. Malinowski, R. Pascanu, P. Battaglia, and T. Lillicrap. A simple neural network module for relational reasoning. In Neural Inform. Process. Syst., pages 4967–4976, 2017.
-  X. Wang and A. Gupta. Videos as space-time region graphs. In ECCV, 2018.
-  D. Xu, Y. Zhu, C. B. Choy, and L. Fei-Fei. Scene Graph Generation by Iterative Message Passing. In Proc. Conf. Comput. Vision Pattern Recognition, pages 3097–3106, 2017.
-  J. Yang, J. Lu, S. Lee, D. Batra, and D. Parikh. Graph R-CNN for scene graph generation. In European Conf. Comput. Vision, pages 690–706, 2018.
-  M. Y. Yang, W. Liao, H. Ackermann, and B. Rosenhahn. On support relations and semantic scene graphs. ISPRS journal of photogrammetry and remote sensing, 131:15–25, 2017.
-  K. Yi, J. Wu, C. Gan, A. Torralba, P. Kohli, and J. Tenenbaum. Neural-symbolic vqa: Disentangling reasoning from vision and language understanding. In Advances in Neural Information Processing Systems, pages 1039–1050, 2018.
-  L. Yu, P. Poirson, S. Yang, A. C. Berg, and T. L. Berg. Modeling context in referring expressions. In European Conference on Computer Vision, pages 69–85. Springer, 2016.
-  M. Zaheer, S. Kottur, S. Ravanbakhsh, B. Poczos, R. R. Salakhutdinov, and A. J. Smola. Deep sets. In Advances in Neural Information Processing Systems 30, pages 3394–3404. Curran Associates, Inc., 2017.
-  R. Zellers, M. Yatskar, S. Thomson, and Y. Choi. Neural motifs: Scene graph parsing with global context. arXiv preprint arXiv:1711.06640, abs/1711.06640, 2017.
-  J. Zhang, K. J. Shih, A. Elgammal, A. Tao, and B. Catanzaro. Graphical contrastive losses for scene graph parsing. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.
-  J. Zhang, K. J. Shih, A. Tao, B. Catanzaro, and A. Elgammal. An interpretable model for scene graph generation. CoRR, abs/1811.09543, 2018.