Learning Latent Scene-Graph Representations for Referring Relationships

02/26/2019 ∙ by Moshiko Raboh, et al. ∙ 6

Understanding the semantics of complex visual scenes often requires analyzing a network of objects and their relations. Such networks are known as scene-graphs. While scene-graphs have great potential for machine vision applications, learning scene-graph based models is challenging. One reason is the complexity of the graph representation, and the other is the lack of large scale data for training broad coverage graphs. In this work we propose a way of addressing these difficulties, via the concept of a Latent Scene Graph. We describe a family of models that uses "scene-graph like" representations, and uses them in downstream tasks. Furthermore, we show how these representations can be trained from partial supervision. Finally, we show how our approach can be used to achieve new state of the art results on the challenging problem of referring relationships.



There are no comments yet.


page 1

page 3

page 5

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Understanding the full semantics of rich visual scenes is a complex task that involves detecting individual entities, as well as reasoning about the joint combination of the entities and the relations between them. To represent entities and their relations jointly, it is natural to view them as a graph, where nodes are entities and edges represent relations. Such representations are often called Scene Graphs (SGs) [16]. Because SGs allow to explicitly reason about images, substantial efforts have been made to infer them from raw images [15, 16, 40, 26, 46, 11, 48].

Figure 1: Differentiable Scene Graphs

: An intermediate “graph-like” representation that provides a distributed representation for each entity and pair of entities in an image. Differentiable scene graphs can be learned with gradient descent in an end-to-end manner, only using supervision about a downstream visual reasoning task. (referring relationships here).

While scene graphs have been shown to be useful for various tasks [15, 16, 13], using them as a component in a visual reasoning system is challenging: (a) Because scene graphs are discrete and non-differentiable, it is difficult to learn them end-to-end from a downstream task. (b) The alternative is to pre-train SG predictors separately from supervised data, but this requires arduous and prohibitive manual annotation. Moreover, pre-trained SG predictors have low coverage, because the set of labels they are pre-trained on rarely fits the needs of a downstream task. For example, given an image of a parade and a question “point to the officer on the black horse”, that horse might not be a node in the graph, and the term “officer” might not be in the vocabulary. Given these limitations, it is an open question how to make scene graphs useful for visual reasoning applications.

In this work, we describe Differentiable Scene-Graphs (DSG), which address the above challenges (Figure 1). DSGs are an intermediate representation trained end-to-end from the supervision for a downstream reasoning task. The key idea is to relax the discrete properties of scene graphs such that each entity and relation is described with a dense differentiable descriptor.

We demonstrate the benefits of DSGs in the task of resolving referring relationships (RR) [21] (see Figure 1). Here, given an image and a triplet query subject, relation, object, a model has to find the bounding boxes of the subject and object that participate in the relation.

We train an RR model with DSGs as an intermediate component. As such, DSGs are not trained with direct supervision about entities and relations, but using several supervision signals about the downstream RR task. We evaluate our approach on three standard RR datasets: Visual Genome [22], VRD [27] and CLEVR [14], and find that DSGs substantially improve performance compared to state-of-the-art approaches [27, 21].

To conclude, our novel contributions are: (1) A new Differentiable Scene-Graph representation for visual reasoning, which captures information about multiple entities in an image and their relations. We describe how DSGs can be trained end-to-end with a downstream visual reasoning task without direct supervision of pre-collected scene-graphs. (2) A new architecture for the task of referring relationships, using a DSG as its central component. (3) New state-of-the-art results on the task of referring relationships on the Visual Genome, VRD and CLEVR datasets.

Figure 2: The proposed architecture. The input consists of an image and a relationship query triplet subject, relation, object. (1) A detector produces a set of bounding box proposals. (2) An RoiAlign

layer extracts object features from the backbone using the boxes. In parallel, every pair of box proposals is used for computing a union box, and pairwise features extracted in the same way as object features. (3) These features are used as inputs to a Differentiable Scene-Graph Generator Module which outputs the Differential Scene Graph, a new and improved set of node and edge features. (4) The DSG is used for both refining the original box proposals, as well as a Referring Relationships Classifier, which classifies each bounding box proposal as either

Subject, Object, Other or Background. The ground-truth label of a proposal box will be Other if this proposal is involved in another query relationship over this image. Otherwise the ground truth label will be Background.

2 Referring Relationship: The Learning Setup

In the referring relationship task [21] we are given an image and a subject-relation-object query . The goal is to output a bounding box for the subject, and another bounding box for the object. In practice sometimes there are several boxes for each. See Fig. 1 for a sample query and expected output.

Following [21], we focus on training a referring relationship predictor from labeled data. Namely, we use a training set consisting of images, queries and the correct boxes for these queries. We denote these by . As in [21], we assume that the vocabulary of query components (subject, object and relation) is fixed.

In our model, we break this task into two components that we optimize in parallel. We fine-tune the position of bounding boxes such that they cover entities tightly, and we also label these boxes as one of the following four possible labels. The labels “Subject” and “Object” disambiguate between the ’s’ and ’o’ entities in the query. The label “Other” refers to boxes corresponding to additional entities, not mentioned in the query, and the label “Background” refers cases where the box does not describe an entity. We refer to these two optimization goals as Box Refiner and Referring Relationships Classifier.

3 Differentiable Scene Graphs

We start by discussing the motivation and potential advantages of using intermediate scene-graph-like representations, as compared to standard scene graphs. Then, we explain how DSGs fit into the full architecture of our model.

3.1 Why use intermediate DSG layers?

A scene graph (SG) represents entities and relations in an image as a set of nodes and edges. A “perfect” SG (representing all entities and relations) captures most of the information needed for visual reasoning, and thus should be useful as an intermediate representation. Such a SG can then be used by downstream reasoning algorithms, using the predicted SG as an input. Unfortunately, learning to predict “perfect” scene graphs for any downstream task is unlikely due to the aforementioned challenges: First, there is rarely enough data to train good SG predictors, and second, learning to predict SGs in a way that is independent of the downstream task, tends to yield less relevant SGs.

Instead, we propose an intermediate representation, termed a “Differentiable Scene Graph” layer (DSG), which captures the relational information as in a scene graph but can be trained end-to-end in a task-specific manner (Fig. 2

). Like SGs, a DSG keeps descriptors for visual entities and their relations. Unlike SGs, whose nodes and edges are annotated by discrete values (labels), a DSG contains a dense distributed representation vector for each detected entity (termed

node descriptor) and each pair of entities (termed edge descriptor). These representations are themselves learned functions of the input image, as we explain in the supplemental material. Like SGs, a DSG only describes candidate boxes which cover entities of interests and their relations. Unlike SGs, each DSG descriptor encompasses not only the local information about a node, but also information about its context. Most importantly, because DSGs are differentiable, they are used as input to downstream visual-reasoning modules, in our case, a referring relationships module.

DGSs provide several computational and modelling advantages:

Differentiability. Because node and edge descriptors are differentiable functions of detected boxes, and are fed into a differentiable reasoning module, the entire pipeline can be trained with gradient descent.

Dense descriptors. By keeping dense descriptors for nodes and edges, the DSG keeps more information about possible semantics of nodes and edges, instead of committing too early to hard sparse representations. This allows it to better fit downstream tasks.

Supervision using downstream tasks. Collecting supervised labels for training scene graphs is hard and costly. DGSs can be trained using training data that is available for downstream tasks, saving costly labeling efforts. With that said, when labeled scene graphs are avilable for given images, that data can be used when training the DSG, using an added loss component.

Holistic representation.

DSG descriptors are computed by integrating global information from the entire image using graph neural networks (see supplemental materials). Combining information across the image increases the accuracy of object and relation descriptors.

3.2 The DSG Model for Referring relationships

We now describe how DSGs can be combined with other modules to solve a visual reasoning task. The architecture of the model is illustrated in Fig. 2. First, the model extracts bounding boxes for entities and relations in the image. Next, it creates a differentiable scene-graph over these bounding boxes. Then, DSG features are used by two output modules, aimed at answering a referring-relationship query: a Box Refiner module that refines the bounding box of the relevant entities, and an Referring Relationships Classifier module that classifies each box as Subject, Object, Other or Background. We now describe these components in more detail.

Object Detector. We detect candidate entities using a standard region proposal network (RPN) [35], and denote their bounding boxes by ( may vary between images). We also extract a feature vector for each box and concatenate it with the box coordinates, yielding . See details in the supplemental material

Relation Feature Extractor. Given any two bounding boxes and we consider the smallest box that contains the two boxes (their “union” box). We denote this “relation box” by and its features by . Finally, we denote the concatenation of the features and box coordinates by .

Figure 3: The effect of box refinement and RR classification. (a) The DSG network is applied to an input image. (b) The object detector component generates box proposals for entities in the image. (c) The RR classifier component uses information from DSG to label candidate boxes as object or subject entities. Then, the box refinement component also uses DSG information, this time to improve box locations for those boxes labeled as entities by RR classifier. Here, boxes are tuned to focus on the most relevant entities in the image: the two “men”, the “surfboard”, the “sky” and the “ocean”. (d) Once the RR classifier labeled entity boxes, it can correctly refer to the entities in the query cloud, in, sky (sky in green, clouds in violet). (e) Examples of candidate boxes classified by RR classifier as background (non-entity), allowing to skip them when answering queries.

Differentiable Scene-Graph Generator. As discussed above, the goal of the DSG Generator is to transform the above features and into differentiable representations of the underlying scene graph. Namely, map these features into a new set of dense vectors and representing entities and relations. This mapping is intended to incorporate the relevant context of each feature vector. Namely, the representation contains information about the entity, together with its image-wide context.

There are various possible approaches to achieve this mapping. Here we use the model proposed by [11], which uses a graph neural network for this transformation. See supplemental materials for details on this network.

Multi-task objective. In many domains, training with multi-task objectives can improve the accuracy of individual tasks, because auxiliary tasks operate as regularizers, pushing internal representations away from overfitting and towards capturing useful properties of the input. We follow this idea here and define a multi-task objective that has three components: (a) a Referring Relationships Classifier matches boxes to subject and object query terms. (b) A Box Refiner predicts accurate tight bounding boxes. (c) A Box Labeler recognizes visual entities in boxes if relevant ground truth is available. We also tune an object detector RPN network producing box proposals for our model.

Fig. 3 illustrates the effect of the first two components, and how they operate together to refine the bounding boxes and match them to the query terms. Specifically, Fig. 3c, shows how boxes refinement produces boxes that are tight around objects and subjects, and Fig. 3d shows how RR classification matches boxes to query terms.

(A) Referring Relationships Classifier. Given a DSG representation, we use it for answering referring relationship queries. Recall that the output of an RR query subject, relation, object should be bounding boxes containing subjects and objects that participate in the query relation. Our model has already computed bounding boxes , as well as representations for each box. We next use a prediction model that takes as input features describing a bounding box and the query, and outputs one of four labels Subject, Object, Other, Background where Other refers to a bounding box which is not the query Subject or Object and Background

refers to a false entity proposal. Denote the logits generated by this classifier for the

box by . The output set (or ) is simply the set of bounding boxes classified as Subject (or Object). See supplemental materials for further implementation details.

(B) Box Refiner. The DSG is also used for further refinement of the bounding-boxes generated by the RPN network. The idea is that additional knowledge about image context can be used to improve the coordinates of a given entity. This is done via a network that takes as input the RPN box coordinates and a differentiable representation for box , and outputs new bounding box coordinates. See Fig. 3 for an illustration of box refinement, and the supplemental material for further implementation details.

(C) Optional auxiliary losses: Scene-Graph Labeling. In addition to the Box Refiner and Referring Relationships Classifier modules described above, one can also use supervision about labels of entities and relations if these are available at training time. Specifically, we train an object-recognition classifier operating on boxes, which predicts the label of every box for which a label is available. This classifier is trained as an auxiliary loss, in a multi-task fashion, and is described in detail below.

4 Training with Multiple Losses

We next explain how our model is trained for the RR task, and how we can also use the RR training data for supervising the DSG component. We train with a weighted sum of three losses: (1) Referring Relationships Classifier (2) Box Refiner (3) Optional Labeling loss. We now describe each of these components. Additional details are provided in the supplemental material.

4.1 Referring Relationship Classification Loss

The Referring Relationships Classifier (Sec. 3.2) outputs logits for each box, corresponding to its prediction (subject, object, etc.). To train these logits, we need to extract their ground-truth values from the training data. Recall that a given image in the training data may have multiple queries, and so may have multiple boxes that have been tagged as subject or object for the corresponding queries. To obtain the ground-truth for box and query we take the following steps. First, we find the ground-truth box that has maximal overlap with box . If this box is either a subject or object for the query , we set to be Subject or Object respectively. Otherwise, if the overlap with a ground-truth box for a different image-query is greater than , we set (since it means there is some other entity in the box), and we set if the overlap is less than . If the overlap is in we do not use the box for training. For instance, given a query woman, feeding, giraffe with ground-truth boxes for “woman” and “giraffe”, consider the box in the RPN that is closest to the ground-truth box for “woman”. Assume the index of this box is . Similarly, assume that the box closest to the ground-truth for “giraffe’ has index . We would have , and the rest of the values would be either Other or Background. Given these ground truth values, the Referring Relationship Classifier Loss is simply the sum of cross entropies between the logits and the one-hot vectors corresponding to .

4.2 Box Refiner Loss

To train the Box Refiner, we use a smooth loss between the coordinates of the refined (predicted) boxes and their ground truth ones.

4.3 Scene-Graph Labeling Loss

When ground-truth data about entity labels is available, we can use it as an additional source of supervision to train the DSG. Specifically, we train two classifiers. A classifier from features of entity boxes to the set of entity labels, and a classifier from features of relation boxes to relation labels. We then add a loss to maximize the accuracy of these classifiers with respect to the ground truth box labels.

4.4 Tuning the Object Detector

In addition to training the DSG and its downstream visual-reasoning predictors, the object detector RPN is also trained. The output of the RPN is a set of bounding boxes. The ground-truth contains boxes that are known to contain entities. The goal of this loss is to encourage the RPN to include these boxes as proposals. Concretely, we use a sum of two losses: First, a RPN classification loss, which is a cross entropy over RPN anchors where proposals of 0.8 overlap or higher with the ground truth boxes were considered as positive. Second, an RPN box regression loss which is a smooth L1 loss between the ground-truth boxes and proposal boxes.

Figure 4: Qualitative examples demonstrating successful predictions of the DSG model (size left panels) and errors (6 right panels). The right panels illustrate common failure cases for each error type. a. Missed detection, the detector missed the glasses on the table. b,c. Misclassified object, the cake is detected but classified as a background. d. misclassified relation. The box classified as Subject is indeed a man but it is not the man that has the required relation with the skate. e,f. Multiplicity, Either too few or too many GT boxes are classified as Subject or Object.
Average IOU
Visual Genome VRD CLEVR
subject object subject object subject object
SS [23] 0.399 0.469 0.320 0.371 0.740 0.740
CO [6] 0.414 0.490 0.347 0.389 0.691 0.691
VRD [27] 0.417 0.480 0.345 0.387 0.734 0.732
SASS [21] 0.421 0.482 0.369 0.410 0.778 0.778
no-DSG 0.412 0.47 0.333 0.366 0.937 0.937
DSG 0.489 0.539 0.4 0.435 0.963 0.963
Table 1: Comparison with baselines. Test-set mean IOU in the referring relationship task for the baselines in Sec. 5.3 and the Differentiable Scene Graph (DSG) model. Results are also reported for a no-DSG model (see Sec. 6.2) which classifies the referring relationship directly from the RPN output.
Average IOU
subject object
Two Step
Table 2: Model ablations

: Results, including standard error for DSG variants on the validation set of Visual Genome dataset. DSG values slightly differ from Table 1 which computed IOU on the test set. The various models are described in Sec. 


5 Experiments

In the following sections we provide details about the datasets, training, baselines models, evaluation metrics, model ablations and results. Due to space consideration, the implementation details of the model are provided in in the supplemental material.

5.1 Datasets

We evaluate the model in the task of referring relationships across three datasets, each exhibiting a unique set of characteristics and challenges.
CLEVR [14]. A synthetic dataset generated from scene-graphs with four spatial relations: “left”, “right”, “front” and “behind”, and 48 entity categories. It has over 5M relationships where 33% are ambiguous entities (multiple entities of the same type in an image).
VRD [27]. The Visual Relationship Detection dataset contains 5,000 images with 100 entity categories and 70 relation categories. In total, VRD contains 37,993 relationship annotations with 6,672 unique relationship types and 24.25 relations per entity category. 60.3% of these relationships refer to ambiguous entities.
Visual Genome [22]. VG is the largest public corpus for visual relationships in real images, with 108,077 images annotated with bounding boxes, entities and relations. On average, images have 12 entities and 7 relations per image. In total, there are over 2.3M relationships where 61% of those refer to ambiguous entities.

For a proper comparison with previous results [21], we used the data from [21] including the same entity and relation categories, query relationships and data splits.

5.2 Evaluation Metrics

We compare our model to previous work using the average IOU for subjects and for objects. To compute the average subject IOU, we first generate two binary attention maps: one that includes all the ground truth boxes labeled as Subject (recall that few entities might be labeled as Subject) and the other includes all the box proposals predicted as Subject. If no box is predicted as Subject, the box with the highest score for the label Subject is included in the predicted attention map. We then compute the Intersection-Over-Union between the binary attention maps. For a proper comparison with previous work [21], we use . The object boxes are evaluated in the exact same manner.

5.3 Baselines

The Referring Relationship task was introduced recently [21], and the SSAS model was proposed as a possible approach (see below). We report the results for the baseline models in [21]. When evaluating our Differentiable Scene-Graph model, we use exactly the evaluation setting as in [21] (i.e., same data splits, entity and relation categories). The baselines reported are:

  1. Symmetric Stacked Attention Shifting (SSAS): [21] An iterative model that localizes the relationship entities using attention shift component learned for each relation.

  2. Spatial Shifts [23]: Same as SSAS, but with no iterations and by replacing the shift attention mechanism with statistically learned shift per relation that ignores the semantic meaning of entities.

  3. Co-Occurrence [6]: Uses an embedding of the subject and object pair for attending over the image features.

  4. Visual Relationship Detection (VRD) [27]: Similar to Co-Occurrences model, but with an additional relationship embedding.

6 Results

Table 1 provides average IOU for Subject and Object over the three datasets described in Sec. 5.1. We compare our model to four baselines described in Sec. 5.3. Our Differentiable Scene-Graph approach outperforms all baselines in terms of the average IOU.

Our results for the CLEVR dataset are significantly better than those in [21]. Because CLEVR objects have a small set of distinct colors (Fig 5), object detection in CLEVR is much easier than in natural images, making it easier to achieve high IOU. The baseline model without the DSG layer (no-DSG) is an end-to-end model with a two-stage detector in contrast to [21] and already improves strongly over prior work with 93.7%, and our novel DSG approach further improves to 96.3% (reducing error by 50%).

Figure 5: A typical image from the CLEVR [14] dataset. Image was trimmed to focuse on areas with visual content.

6.1 Analysis of success and failure cases.

Fig. 4 shows example of success cases and failure cases. We further analyzed the types of common mistakes and their distribution. Since DSGs depend on box proposals, they are sensitive to the quality of the object detectors. Manual inspection of images revealed four main error types: (1) 30%: Detector failed: the relevant box is missing from the box proposal list. (2) 23.3% Subject or Object detected but classified as Other or as Background. (3) 16.6%: Relation misclassified. The entities classified as Subject and Object match the query, but without the required relation. (4) 16.6%: Multiplicity. Either too few or too many of the GT boxes are classified as Subject or Object. (5) 13.3%: Other, including incorrect GT, and hard-to-decide cases.

6.2 Model Ablations

We explored the power of DSGs through model ablations. First, since the model is trained with three loss components, we quantify the contribution of the Box Refinement loss and the Scene-Graph Labeling loss (it is not possible to omit the Referring Relationships Classifier loss). We further evaluate the contribution of the DSG compared with a two-step approach which first predicts an SG, and then reasons over it. We compare the following models:

  1. DSG: The Differentiable Scene-Graph model described in Sec. 3.2 and trained as described in Sec. 4.

  2. Two steps: Two-step model. We first predict a scene-graph, and then match the query with the SG. The SG predictor consists of the same components used in the DSG: A box detector, DSG dense descriptors, and an SG labeler. It is trained with the same set of SG labels used for training the DSG. Details in the supplemental material.

  3. DSG -SGL: DSG without the Scene-Graph Labeling component described in Sec. 4.3).

  4. DSG -BR: DSG where the Box Refiner component of Section 4.2 is replaced with fine tuning the coordinates of the box proposal using the visual features extracted by the Object Detector. This variant allows us to quantify the benefit of refining the box proposals based on the differentiable representation of the scene.

  5. no-DSG: A baseline model that does not use the DSG representations. Instead, the model includes only an Object Detector and a referring relationship classifier. The referring relationship classifier uses the features extracted by the Object Detector instead of the

    features. This model allows us to quantify the benefit of the differentiable scene representation for referring relationship classification.

Table 2 provides results of ablation experiments for the Visual Genome dataset [22] on the validation set. All model variants based on scene representation perform better than the model that does not use the DSG representation (i.e., DSG -SG) in terms of average IOU over subject and object, demonstrating the power of contextualized scene representation.

The DSG model outperforms all model ablations, illustrating the improvements achieved by using partial supervision for training the differentiable scene-graph. Fig. 7 illustrates the the effect of ablating various components of the model.

Figure 6:

Inferring a Scene Graph from a DSG. Applying the RPN to this image results in 28 boxes. In (a) we show five of these, which received the largest weight in the attention model (details in the supplemental material.) within the DSG generator (Sec. 

3.2). As mentioned in Sec. 4.3 in “Scene-Graph Labeling Loss” we can use the DSG for generating a labeled scene graph, corresponding to a fixed set of entities and relations. (b) shows this scene graph (i.e., the output of the classifiers predicting entity labels and relations), restricted to the largest confidence relations. It can be seen that most relations are correct, despite not having trained this model on complete scene graphs.

6.3 Inferring SGs from DSGs

While the DSG layer is not designed for predicting scene graphs from images, it can be be used for inferring scene graphs. We decoded SG nodes that are contained in the RR vocabulary by constructing two classifiers: A 1-layer classifier mapping the node to logits over the entity vocabulary, and a 1-layer classifier that maps the edge descriptors to logits over the relation vocabulary. Fig. 6 illustrates the result of this inference, showing a Scene-Graph inferred from the DSG, trained using this loss. The predicted graph is indeed largely correct despite the fact that it was not directly trained for this task.

We further analyzed the accuracy of predicted SGs by comparing to ground-truth SG on visual genome (complete SGs were not used for training, only for analysis in this rebuttal). SGs decoded from DSGs achieve accuracy of for object labels and for relations (calculated for proposals with IOU ).

Figure 7: Comparing failures of ablations models with DSG predictions. The top row shows DSG results, while the bottom row shows results from different ablations models as specified in  Sec. 6.2. In the first column, Two Step model the scene-graph did not include the shirt of one of the the men, therefore this ”subject” prediction was missed. In the second column, the DSG -SGL predicted failed to distinct between few entity classes ‘woman” and “child”. In the third column, the DSG refine the box of “sky” to cover all of the sky area. In the last column, the NO-DSG didn’t classify the ”object” box correctly.

7 Related Work

Graph Neural Networks. Recently, major progress has been made in constructing graph neural networks (GNN). These refer to a class of neural networks that operate directly on graph-structured data by passing local messages [7, 24]. Variants of GNNs have been shown to be highly effective at relational reasoning tasks [38], classification of graphs [2, 3, 31, 4], and classification of nodes in large graphs [18, 9]. The expressive power of GNNs has also been studied in [11, 45]. GNNs have also been applied to visual understanding in [11, 41, 39, 10] and control [37, 1]. Similar aggregation schemes have also been applied to object detection [12].

Visual Relationships. Earlier work aimed to leverage visual relationships for improving detection [36]

, action recognition and pose estimation

[5], semantic image segmentation [8] or detection of human-object interactions [42, 32, 25]. Lu et al. [27] were the first to formulate detection of visual relationships as a separate task. They learn a likelihood function that uses a language prior based on word embeddings for scoring visual relationships and constructing scene graphs, while other recent works proposed better methods for relationships detection [19, 47].

Scene Graphs. Scene graphs provide a compact representation of the semantics of an image, and have been shown to be useful for semantic-level interpretation and reasoning about a visual scene [13]. Extracting scene graphs from images provides a semantic representation that can later be used for reasoning, question answering [43]

, and image retrieval

[15, 34].

Previous scene-graph prediction work used attention [33] or neural message passing [40]. [30] suggested to predict graphs directly from pixels in an end-to-end manner. NeuralMotif [46] considers global context using an RNN by reading sequentially the independent predictions for each entity and relation and then refines those predictions.

Referring Relationships. Several recent studies looked into the task of detecting an entity based on a referring expression [17, 20], while taking context into account. [28] described a model that has two parts: one for generating expressions that point to an entity in a discriminative fashion and a second for understanding these expressions and detecting the referred entity. [44] explored the role of context and visual comparison with other entities in referring expressions.

Modelling context was also the focus of [29], using a multi-instance-learning objective. Recently, [21] introduced an explicit iterative model that localizes the two entities in the referring relationship task, conditioned on one another using attention from one entity to another. However, in contrast to this work, we show an implicit model that uses latent scene context, resulting in new state of the art results on three vision datasets that contain visual relationships.

8 Conclusion

This work is motivated by the assumption that accurate reasoning about images may require access to a detailed representation of the image. While scene graphs provide a natural structure for representing relational information, it is hard to train very dense SGs in a fully supervised manner, and for any given image, the resulting SGs may not be appropriate for downstream reasoning tasks. Here we advocate DSGs, an alternative representation that captures the information in SGs, which is continuous and can be trained jointly with downstream tasks. Our results, both qualitative (Fig 4 ) and quantitative (Table 1,2), suggest that DSGs effectively capture scene structure, and that this can be used for down-stream tasks such as referring relationships.

One natural next step is to study such representations in additional downstream tasks that require integrating information across the image. Some examples are caption generation and visual question answering. DSGs can be particularly useful for VQA, since many questions are easily answerable by scene graphs (e.g., counting questions and questions about relations). Another important extension to DSGs would be a model that captures high-order interactions, as in a hyper-graph. Finally, it will be interesting to explore other approaches to training the DSG, and in particular finding ways for using unlabeled data for this task.