Images are not simply sets of objects: each image represents a web of interconnected relationships. These relationships between entities carry semantic meaning and help a viewer differentiate between instances of an entity. For example, in an image of a soccer match, there may be multiple persons present, but each participates in different relationships: one is kicking the ball, and the other is guarding the goal. In this paper, we formulate the task of utilizing these "referring relationships" to disambiguate between entities of the same category. We introduce an iterative model that localizes the two entities in the referring relationship, conditioned on one another. We formulate the cyclic condition between the entities in a relationship by modelling predicates that connect the entities as shifts in attention from one entity to another. We demonstrate that our model can not only outperform existing approaches on three datasets --- CLEVR, VRD and Visual Genome --- but also that it produces visually meaningful predicate shifts, as an instance of interpretable neural networks. Finally, we show that by modelling predicates as attention shifts, we can even localize entities in the absence of their category, allowing our model to find completely unseen categories.READ FULL TEXT VIEW PDF
The task of referring relationships is to localize subject and object
Entity-Relationship (E-R) Search is a complex case of Entity Search wher...
People often refer to entities in an image in terms of their relationshi...
University timetabling (UTT) is a complex problem due to its combinatori...
Relational reasoning is a central component of intelligent behavior, but...
Intelligence analysts have long struggled with an abundance of data that...
This paper focuses on the study of recognizing discontiguous entities.
Referring expressions in everyday discourse help identify and locate entities111We use the term “entities” for what is commonly referred to as “objects” to differentiate from the term object in subject-predicate-object relationships. in our surroundings. For instance, we might point to the “person kicking the ball” to differentiate from the “person guarding the goal” (Figure 1). In both these examples, we disambiguate between the two persons by their respective relationships with other entities . While one person is kicking the ball, the other is guarding the goal. The eventual goal is to build computational models that can identify which entities others are referring to .
To enable such interactions, we introduce referring relationships — a task where, given a relationship, models should know which entities in a scene are being referred to by the relationship. Formally, the task expects an input image along with a relationship, which is of the form subject - predicate - object, and outputs localizations of both the subject and object. For example, we can express the above examples as person - kicking - ball and person - guarding - goal (Figure 1). Previous work has attempted to disambiguate entities of the same category in the context of referring expression comprehension [28, 24, 41, 42, 11]
. Their task expects a natural language input, such as “a person guarding the goal”, resulting in evaluations that require both natural language as well as computer vision components. It can be challenging to pinpoint whether errors made by these models occur from either the language or the visual components. By interfacing with a structured relationship input, our task is a special case of referring expressions that alleviates the need to model language.
Referring relationships retain and refine the algorithmic challenges at the core of prior tasks. In the object localization literature, some entities such as zebra and person are highly discriminative and can be easily detected, while others such as glass and ball tend to be harder to localize . These difficulties arise due to, for example, small size and non-discriminative composition. This difference in difficulty translates over to the referring relationships task. To tackle this challenge, we use the intuition that detecting one entity becomes easier if we know where the other one is. In other words, we can find the ball conditioned on the person who is kicking it and vice versa. We train this cyclic dependency by rolling out our model and iteratively passing messages between the subject and the object through an operator defined by the predicate. We describe this operator in more detail in Section 3.
However, modelling this predicate operator is not straightforward, which leads us to our second challenge. Traditionally, previous visual relationship papers have learned an appearance-based model for each predicate [20, 23, 26]
. Unfortunately, the drastic appearance variance of predicates, depending on the entities involved, makes learning predicate appearance models challenging. For example, the appearance for the predicatecarrying can vary significantly between the following two relationships: person - carrying - phone and truck - carrying - hay. Instead, inspired by the moving spotlight theory in psychology [18, 35], we bypass this challenge by using predicates as a visual attention shift operation from one entity to the other. While one shift operation learns to move attention from the subject to the object, an inverse predicate shift similarly moves attention from the object back to the subject. Over multiple iterations, we operationalize these asymmetric attention shifts between the subject and the object as different types of message operations for each predicate [37, 9].
In summary, we introduce the task of referring relationships, whose structured relationship input allows us to evaluate how well we can unambiguously identify entities of the same category in an image. We evaluate our model222https://github.com/StanfordVL/ReferringRelationships. on three vision datasets that contain visual relationships: CLEVR , VRD  and Visual Genome . , , and of relationships in these datasets refer to ambiguous entities, i.e. entities that have multiple instances of the same category. We extend our model to perform attention saccades  using relationships belonging to a scene graph . Finally, we demonstrate that in the absence of a subject or the object, our model can still disambiguate between entities while also localizing entities from new categories that it has never seen before.
To properly situate the task of referring relationships, we explore the evolution of visual relationships as a representation. Next, we survey the inception of referring expression comprehension as a similar task, summarize how attention has been used in the deep learning literature, and survey other technical approaches that are similar to our approach.
There is a long history of vision papers moving beyond simple object detection and modelling the context around the entities [27, 31] or even studying object co-occurrences [8, 19, 25] to improve classification and detection itself. Our task on referring relationships was motivated by such papers. Unlike these models, we utilize a formal definition for context in the form of a visual relationship.
Pushing along this thread, visual relationships were initially limited to spatial relationships: above, below, inside and around . Relationships were then extended to include human interactions, such as holding and carrying . Extending the definition further, the task of visual relationship detection was introduced along with a dataset of spatial, comparative, action and verb predicates . More recently, relationships were formalized as part of an explicit formal representation for images called scene graphs [14, 17], along with a dataset of scene graphs called Visual Genome 
. These scene graphs encode the entities in a scene as nodes in a graph that are connected together with directed edges representing their relative relationships. Scene graphs have shown to improve a number of computer vision tasks, including semantic image retrieval, image captioning  and object detection . Newer work has extended models for relationship detection to use co-occurrence statistics [26, 32, 37]
and have even formulated the problem in a reinforcement learning framework. These papers focused primarily on detecting visual relationships categorically — they output relationships given an input image. In contrast, we focus on the inverse problem of localizing the entities that take part in an input relationship. We disambiguate entities in a query relationship from other entities of the same category in the image. Moreover, while all previous work has attempted to learn visual features of predicates, we propose that the visual appearances of predicates are too varied and can be more effectively learnt as an attention shift, conditioned on the entities in the relationship.
Such an inverse task of disambiguating between different regions in an image has been studied under the task of referring expression comprehension . This task uses an input language description to find the referred entities. This work has been motivated by human-robot interaction, where the robot would have to disambiguate which entities the human user is referring to . Models for their task have been extended to include global image contrasts , visual relationships  and reward-based reinforcement systems that encourage the generation of unique expressions for different image regions . Unfortunately, all these models require the ability to process both natural language as well as visual constructs. This requirement makes it difficult to disentangle the mistakes as a result of poor language modelling or visual understanding. In an effort to ameliorate these limitations, we propose the referring relationships task — simplifying referring expressions by replacing the language inputs with a structured relationship. We focus solely on the visual component of the model, avoiding confounding errors from language processing.
One key observations about predicates is their large variance in visual appearance . For example, consider these two relationships: person - carrying - phone and truck - carrying - hay. We use an insight from psychology [18, 35], specifically the moving spotlight theory, which suggests that visual attention can be modelled as a spotlight that can be conditioned on and directed towards specific targets. The use of attention has been explored to improve image captioning [38, 2] and even stacked to improve question answering [13, 39]. In comparison, we model two discriminative attention shifting operations for each unique predicate, one conditioned on the subject to localize the object and an inverse predicate shift conditioned on the object to find the subject. Each predicate utilizes both the current estimate of the entities as well as image features to learn how to shift, allowing it to utilize both spatial and semantic features.
Our work also has similarities to knowledge bases, where predicates are often projections in a defined semantic space [3, 6, 22]. Such a method was recently used for visual relationship detection . While these methods have seen success in knowledge base completion tasks, they have only led to a marginal gain for modelling visual relationships. However, unlike these methods, we do not model predicates as a projection in semantic space but as a shift in attention conditioned on an entity in a relationship. Our method can be thought of as a special case of deformable parts model  with two deformable parts, one for each entity. Finally, our messaging passing algorithm can be thought of as a domain-specific specialized version to the message passing in graph convolution approximation methods [9, 15].
Recall that our aim is to use the input referring relationship to disambiguate entities in an image by localizing the entities involved in the relationship. Formally, the input is an image with a referring relationship, S - P - O, which are the subject, predicate and object categories, respectively. The model is expected to localize both the subject and the object.
We begin by using a pre-trained convolutional neural network (CNN) to extract adimensional feature map from the image
. That is, for each image, we extract a 3-dimensional tensor of shape, where is the spatial size of the feature map while is the number of feature channels. Our goal is to decide if each
image region belongs to the subject or object or neither. We can model this problem by representing the image by two binary random variables. For , implies that the subject occupies the region and implies that the object
occupies that region, for some hyperparameter threshold. We now define a graph , where , are the nodes of the graph represented by the image regions and represents an edge from every to . Given the image and relationship, we want to assign and with .
This optimization problem can be reduced to inference on a densely connected graph which can be very expensive. As shown in previous work [44, 16], dense graph inference can be approximated by mean field in Conditional Random Fields (CRF). Such papers allow fully differential inference assuming weighted gaussians as pairwise potentials 
. To achieve greater flexibility in a more principled training framework, we design a general model where the messaging passing during inference is a series of learnt convolutions. More specifically, we design our model with two types of modules: attention and predicate shift modules. While attention models attempt to locate a specific category in an image, the predicate shift modules learn to move attention from one entity to another.
Before we specify our attention and shift operators, let’s revisit the challenges in referring relationships to motivate our design decisions. The two challenges are (1) the difference in difficulty in object detection and (2) the drastic appearance variance of predicates. First, the difference in difficulty arises because some objects like zebra and person are highly discriminative and can be easily detected while others like glass and ball tend to be harder to localize. We can overcome this problem by conditioning the localization of one entity on the other. If we know where the person is, we should be able to estimate the location of the ball that they are kicking.
Second, predicates tend to vary in appearance depending on the objects involved in the relationship. To deal with the wide appearance variance of predicates, we move away from how previous work  attempted to learn appearance features of predicates and instead treat predicates as a mechanism for shifting the attention from one object to another. Relationships like above should learn to focus attention down from the subject when locating the object, and the predicate left of should focus the attention to the right of the subject. Inversely, once we locate the object, the model should use left of to focus attention to the left to confirm its initial estimate of the subject. Note that not all predicates are spatial, so we also ensure that we can model their visual appearances by conditioning the shifts on the image features as well.
Attention modules. With these design goals in mind, we formulate the attention module as an initial estimate of the subject and object localizations by approximating the maximizers , with the soft attention :
where embeds the entity into a dimensional semantic space. Note that
is the Rectified Linear Unit operator., denote the initial attention over the subject and object, which are not conditioned on the predicate at all and only use the entities.
Predicate shift modules. Inspired by the message passing protocol in CRF’s , we design a more general message passing function to transfer information between the two entities. Each message is passed from the subject’s estimate to localize the object and vice versa. In practice, we want the message passed from the subject to the object to be different from the one passed from the object back to the subject. So, we learn two asymmetric attention shifts, one that shifts the location from the subject to its estimate of where it thinks the object is and another one that does the inverse from the object to the subject. We denote these shift operations as and , respectively and define them as convolutions applied in series to the initial estimated assignments:
where the implies that we perform the operation times, each parametrized by and which correspond to learned convolution filters for the inverse predicate and the predicate operations respectively. The operator indicates a convolution with kernels and of size with channels. We set for the last convolution to ensure that and have dimension . While we do not enforce the two shift operators to be inverses of one another, for most predicates, we empirically find that in fact learns the inverse attention shift of . Note that we do not provide any supervision to our shifts and the model is tasked to learn these shifts to improve its entity localizations. The outputs of these two predicate shift operators is a new estimate attention mask over where the our model expects to find the object, , conditioned on its initial estimate of the subject, and vice versa from to .
Each predicate learns its own set of shift and inverse shift functions. And by allowing multiple channels for each set of kernels, our model can formulate shifts as a mixture. For example, carrying might want to focus on the top of the object when the relationship is person - carrying - phone while focusing towards the bottom when the relationship is person - carrying - bag.
Since we want every image region to pass a message to all other regions , we enforce that , i.e. we need a minimum of number of convolutions in series. We arrive at this restriction because the maximum spatial distance that a message needs to travel is and the furtherest image region it can send a message to in each iteration is , where is the image feature size and is the kernel size of each predicate shift convolution.
Running iterative inference. Once we have these estimates, we can modify our image features with using a element-wise multiplication across the channels in the feature map. We can then pass it back to the subject and object attention modules to update their locations:
We can continuously update these locations, conditioned on one another. This amounts to running a maximum a posteriori inference on one entity while using the other entity’s previous location. We finally output and where is a hyper-parameter that determines the number of iterations for which we run inference.
We extract image features using an ImageNet pre-trained ResNet50’s  last activation layer of conv which outputs a dimensional representation and finetune the features. We find that our model performs best with predicate convolution filters with kernel size and channels.
|Mean IoU||KL divergence|
|CLEVR||VRD||Visual Genome||CLEVR||VRD||Visual Genome|
|Spatial shift ||0.740||0.740||0.320||0.371||0.399||0.469||0.643||0.643||2.612||2.318||1.512||1.293|
|VRD [23, 11]||0.734||0.732||0.345||0.387||0.417||0.480||1.024||1.014||2.492||2.171||1.483||1.255|
We start our experiments by evaluating our model’s performance on referring relationships across three datasets, where each dataset provides a unique set of characteristics that complement our experiments. Next, we evaluate how to improve our model in the absence of one of the entities in the input referring relationship. Finally, we conclude by demonstrating how our model can be modularized and used to perform attention saccades through a scene graph.
CLEVR. CLEVR is a synthetic dataset generated from scene graphs , where the relationships between objects are limited to spatial predicates (left, right, front, behind) and distinct entity categories. With over relationships where are ambiguous, along with the ease of localizing object categories, this dataset also allows us to explicitly test the effects of our predicate attention shifts without confounding errors from poor image features or noise in real world datasets.
VRD. Visual relationship detection (VRD) is the most widely benchmarked dataset for relationship detection in real world images . It consists of object and predicate categories in images, with ambiguous relationships out of a total of . With a few examples per object and predicate category, this dataset allows us to evaluate how our model performs when starved for data.
Visual Genome. Visual Genome is the largest dataset for visual relationships in real images that is publicly available . It contains images with over relationship instances. We use version , which focuses on the most common objects with the most common predicate categories. Our experiments on Visual Genome represent a large scale evaluation of our method where of relationships refer to ambiguous entities.
Evaluation Metrics. Recall that the output of our model is localizing the subject and the object of the referring relationship. To evaluate how our model performs, we report the Mean Intersection over Union (IoU), a common metric used in localizing salient parts of an image [4, 5]. This metric measures the average intersection over union between the predicted image regions to those in the ground truth bounding boxes. Next, we report the KL-divergence, which measures the dissimilarity between the two saliency maps and heavily penalizes false positives.
Baseline models. We create three competitive baseline models inspired by related work in entity co-occurrence , spatial attention shifts  and visual relationship detection . The first model tests how much we can leverage only the entities’ co-occurrence, without using the predicate. This model simply embeds the subject and the object and combines them to collectively attend over the image features. The next baseline embeds the entities along with the predicate using a series of dense layers, similar to the vision component in relationship embeddings used in visual relationship detection (VRD) [23, 11]. This model has access to the entire relationship when finding the two entities. Finally, the third baseline replaces our learnt predicate shifts with a spatial shift that we statistically learn for each predicate in the dataset (see supplementary for details). This final model tests whether our model utilizes both semantic information from images and not just the spatial information from the entities to make predictions.
Quantitative results. Across all the datasets, we find that the co-occurrence model is unable to disambiguate between instances of the same category and only performs well when there is only one instance of that category in an image. The spatial shift model does better than the other baselines on CLEVR, where the predicates are spatial and worse on the real world datasets, implying that it is insufficient to model predicates only as spatial shifts. Surprisingly, when evaluating on the CLEVR dataset, we find that VRD model does not properly utilize the predicate and leads to marginal gains over the co-occurrence models. In comparison, we find that our SSAS variants perform better across all metrics. We gain over a Mean IoU on CLEVR. This gain however, is smaller on Visual Genome and VRD as these datasets are noisy and incomplete, penalizing our model for making predictions that are not annotated in the datasets. KL, which only penalizes false predictions highlights that our models are more precise than our baselines. Across the different ablations of SSAS, we notice that having more iterations is better; but the performance saturates after iterations because the predicate shifts and the inverse predicate shifts learn near inverse operations of one another.
Interpreting our results. We can interpret the predicate shifts by synthetically initializing the subject to be at the center of an image, as shown in Figure 3(a). When applying the left predicate shift, we see that the model has learnt to focus its attention to the right, expecting to find the object to the right of the subject. Similarly, the inverse predicate shift learns to do nearly the opposite by focusing attention in the other direction. When visualizing these shifts next to the dataset examples in Visual Genome, we see that the shifts represent the biases that exist in the dataset (Figure 3(b)). For example, since most entities that can be ridden are below the subject, the shifts learn to focus attention down to find the object and up to find the subject. We also find that that our model learns to encode dataset bias in these shifts. Since the perspective of most images in the training set for hit are of people playing tennis or baseball facing left, our model also captures this bias by learning that hit should focus attention to the bottom left to find the entity being hit.
Figure 4 shows numerous examples of how our model shifts attention over multiple iterations. We see that generally across all our test cases the subject and object attention modules learn to use the image features to localize all instances initially on iteration . For example, in Figure 4(a), all the regions that contain person are initially activated. But after the predicate and the inverse predicate shifts, we see that the model learns to move the attention in opposite directions for the predicate left. In the second iteration, both the people are uniquely localized in the image. Figure 4(b) clearly shows that we can easily locate all instances of purple metal cylinders in the image since it is easy to detect entities in CLEVR. Our model learns to identify which purple metal cylinders we are actually referring to on successive iterations while suppressing the other instance.
In Figure 4(c), even though both the subject and object have multiple instances of person and cup, we can disambiguate which person is actually holding the cup. For the same image in Figure 4(d), our model is able to distinguish the cup being held in the previous referring relationship from the one that is on top of the table. In cases where a referring relationship is not unique, like the example in Figure 4(e), we manage to find all instances that satisfy the relationship we care about. Here, we return both persons riding the skateboards. Having learnt from the dataset, that most relationships with stand next to annotate the subject to the left of the object, our model emulates this behaviour in Figure 4(f). However, our model does make a fair share of mistakes - for example, in Figure 4(g), it finds both the persons and isn’t able to distinguish which one is wearing the skis.
|No subject||No object||Only predicate|
|SSAS (iter 1)||0.331||0.359||0.332||0.361|
|SSAS (iter 2)||0.333||0.360||0.334||0.361|
|SSAS (iter 3)||0.335||0.363||0.334||0.365|
Now that we have evaluated our model, one natural question to ask is how important is it for the model to receive both the entities of the relationship as input? Can it localize the person from Figure 1 if we only use ___ - kicking - ball as input? Or can we localize both the subject and the object with only ___ - kicking - ___? We are also interested in taking this task a step further and studying whether we can localize categories that we have never seen before. Previous work has shown that we can localize seen categories in novel relationship combinations  but we want to know if it is possible to localize unseen categories.
We remove all instances of categories like pants, hydrant, etc. that are not in ImageNet ( was pre-trained on ImageNet) from our training set and attempt to localize these novel categories using their relationships. We do not make any changes to our model but alter the training script to randomly (we use a drop rate of ) mask out the subject or object or both in the referring relationships during each iteration. The model learns to attend over general object categories when the entities are masked out. We find that we can in fact localize these missing entities, even if they are from unseen categories. We report results for this experiment on the VRD dataset in Table 2.
A ramification of our model design results in its modularity — the attention and shift modules expect inputs and produce outputs that are image features of shape . We can decompose these modules and stack them like Lego blocks, allowing us to perform more complicated tasks. One particularly interesting extension to referring relationships is attention saccades . Instead of using a single relationship as input, we can extend our model to take an entire scene graph as input. Figure 5 demonstrates how we can iterate between the attention and shift modules to traverse a scene graph. We can start from the phone and can localize the jacket worn by the “woman on the right of the man using the phone”. A scene graph traversal can be evaluated by decomposing the graph into a series of relationships. We do not quantitatively evaluate these saccades here, as its evaluations are already captured by the referring relationships in the graph.
We introduced the task of referring relationships, where our model utilizes visual relationships to disambiguate between instances of the same category. Our model learns to iteratively use predicates as an attention shift between the two entities in a relationship. It updates its belief of where the subject and object are by conditioning its predictions on the previous location estimate of the object and subject, respectively. We show improvements on CLEVR, VRD and Visual Genome datasets. We also demonstrate that our model produces interpretable predicate shifts, allowing us to verify that the model is in fact learning to shift attention. We even showcase how our model can be used to localize completely unseen categories by relying on partial referring relationships and how it can be extended to perform attention saccades on scene graphs. Improvements in referring relationships could pave the way for vision algorithms to detect unseen entities and learn to grow its understanding of the visual world.
Acknowledgements. Toyota Research Center (TRI) provided funds to assist the authors with their research but this article solely reflects the opinions and conclusions of its authors and not TRI or any other Toyota entity. We thank John Emmons, Justin Johnson and Yuke Zhu for their helpful comments.
In the supplementary material, we include more detailed results of our task for every entity and predicate category, allowing us to diagnose which entities or predicates are difficult to model. We also include the learnt predicate and the inverse predicate shifts for all , and predicates we modeled in VRD , CLEVR  and Visual Genome . Furthermore, we explain our baseline models in more detail here.
Given that the closest task to referring relationships is referring expression comprehension , we draw inspiration from this literature when designing our baselines. A frequent approach used by most models for this task involve semantically mapping language expressions to their corresponding image regions [28, 24, 41]co-occurrence and VRD) draws inspiration from this line of work and maps relationships to a semantic feature space and maps them close to the image regions to which they refer to using our attention module.
The difference from the two baseline models is determined by how we embed the relationships to that semantic space. In the case of co-occurrence, we are only interested in studying how well we can model relationship without the predicate and rely simply on co-occurrence statistics. So, we first embed the subject and the object
, concatenate their representations and pass them through a dense layer followed by a RELU non-linearity to allow the two embeddings to interact. For theVRD baseline, we embed the entire relationship similar to prior work  by embeddings all three components of the relationship, concatenating their representation and passing them through a dense and non-linear layer.
Unlike our model, which attends over the subject and object in succession, these models are jointly aware of the entire relationship or at least about the other entity when attending over the image features. Also embedding the predicate and attending over the image with this embedding asks these baselines to model predicates as visual. But predicates such as above or below are not visually significant and can only be modelled as a relative shift from one entity to another. We show through our experiments that such baselines are not able to perform as well as our model nor are interpretable.
Instead of learning the attention shifts for each predicate, we assume (incorrectly) that all predicates are simply spatial shifts and model each predicate as a shift function. We learn the shift statistically from the relative locations of the two entities of the relationship. We visualize these statistically calculated shifts in Figures 8, 10 and 12. We normalize the shifts so visualize the heatmaps. They don’t show the actual values of how much each predicate shifts attention but only shows the direction of the shift. We see the as expected left push attention to the right, etc. This baseline uses our attention modules to find the subject and object and uses these precalculated shifts to move attention around. We only need to train the attention module, which is equivalent to training our SSAS model with zero iterations. During evaluation, we use these statistical spatial shifts to move attention.
This baseline is useful in two ways. First, it demonstrates that it is important to model predicates as both spatial as well as semantic. Second, it allows us to compare the learnt predicate shifts with these calculated ones to verify that our SSAS models are in fact learning spatial shifts as well.
While above and below are spatial predicates, others like hit or sleep on are both spatial as well as semantic. hit usually refers to entities around the subject and are usually balls. Similarly, sleep on usually refers to something below the subject and typically a bed or couch. We show the learnt predicate shifts of all the predicates in the three datasets in Figures 7, 9 and 11.
As expected most relationships that are spatial are interpretable. In Figure 7, above moves attention below while its inverse moves it up. hit focuses on the right bottom, emulating the dataset bias of right handed people hitting tennis or baseball. In Figure 11, wearing shifts attention all over the body of the subject focusing mainly on shirts, pants and glasses. By splits the attention both to the left and to the right to find what the subject is next to. Some predicates, like attached to are harder to interpret as they depend on both the semantic as well as spatial shifts. While our model uses the image features to learn these shifts, our current spatial shift visualization does not create an interpretable predicate shift.
One of the benefits of referring relationships is its structured representation of the visual world, allowing us to study which entities and predicates are hard to model. In this section we report the Mean IoU of our model on all the predicate categories for the three datasets in Tables 3 and 5. Note that we don’t report the results for CLEVR here since all the spatial predicates are equally represented in the dataset and perform equally across all categories.
Across most predicates we find that the object localization is much harder than the subject’s. This occurs because most objects tend to be smaller objects which are better localized by first attending over the subject first. We also see that size is an important factor in detection as predicates like carry and use usually have a larger subject and a smaller object and we find that the IoU for the subject is much higher than that of the object. We also see that when entities are partially occluded, for example subject - drive - object, the object IoU is much higher than the occluded subject.
We run a similar analyze of the performance of our model across all the entity categories and report Mean IoU results in Tables 4 and 6. Note that we don’t report the results for CLEVR here since all the entities perform equally across all categories.
We find that the Mean IoU for all entities in Visual Genome are higher than the ones in VRD, implying that more data for each of these categories helps the model learn to attend over the right image regions. In Figure 6, we find that with the predicate shifts, we can detect smaller objects, like face, ear, bowl, eye, a lot better. Some entities like shelves and light don’t perform well on the dataset because not all the shelves or light sources are annotated in the dataset, causing the model’s correct predictions to be penalized. Surprisingly, the model has a hard time finding bags, perhaps because it learns that bags are often found being worn or carried by people in the training set but the test set contains bags that are on the ground or resting against other entities.
|next to||0.3338||0.3867||outside of||-||0.7778||sit next to||0.3158||0.3152|
|stand next to||0.4429||0.4436||park next||0.4012||0.5426||sleep on||0.3543||0.5429|
|sit behind||0.5854||0.9111||park behind||0.8545||0.5050||in the front of||0.3644||0.4009|
|under||0.4639||0.5188||stand under||0.2304||0.3622||sit under||0.2716||0.3158|
|with||0.3522||0.2823||on the top of||0.2896||0.4416||on the left of||0.2290||0.3272|
|on the right of||0.2864||0.3338||sit on||0.4281||0.4271||ride||0.4513||0.4936|
|drive on||0.7723||0.8269||taller than||0.4431||0.4423||eat||0.4726||-|
|park on||0.4639||0.7347||lying on||0.3457||0.6335||pull||0.4737||0.3362|
|wearing a||0.5208||0.3946||made of||0.4430||0.3389||on front of||0.2215||0.6592|
|attached to||0.2627||0.4524||at||0.4473||0.5085||on a||0.3471||0.4978|
|of a||0.2968||0.5857||hanging on||0.3166||0.4830||near||0.3931||0.4935|
|in a||0.3580||0.4629||has||0.6183||0.3341||parked on||0.3851||0.5559|
|from||0.2940||0.5188||has a||0.5841||0.3016||standing on||0.4715||0.6338|
|on side of||0.2453||0.5505||in||0.3574||0.5320||wearing||0.4466||0.1613|
|have||0.5750||0.2201||are on||0.3510||0.6001||are in||0.4185||0.6917|
|in front of||0.3963||0.5210||looking at||0.4503||0.4787||belonging to||0.3250||0.6243|
|on top of||0.3803||0.5735||holds||0.5194||0.3834||inside of||0.2398||0.3430|
|along||0.3647||0.5030||hanging from||0.2508||0.2905||standing in||0.4748||0.6173|
The CLEVR dataset is annotated with objects in 3D space . To use the dataset in the same manner as VRD  and VisualGenome , we converted all the 3D entity locations into 2D bounding boxes, with respect to the viewing perspective of every image. We will release the conversion code as well as the bounding box annotations that we added to CLEVR. Figure 6 showcases an example image annotated with our bounding boxes.
Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on, pages 1–8. IEEE, 2008.
Learning entity and relation embeddings for knowledge graph completion.In AAAI, pages 2181–2187, 2015.
International Conference on Machine Learning, pages 2048–2057, 2015.
Conditional random fields as recurrent neural networks.In Proceedings of the IEEE International Conference on Computer Vision, pages 1529–1537, 2015.