Visual Relationships as Functions: Enabling Few-Shot Scene Graph Prediction

06/12/2019 ∙ by Apoorva Dornadula, et al. ∙ Stanford University 29

Scene graph prediction --- classifying the set of objects and predicates in a visual scene --- requires substantial training data. The long-tailed distribution of relationships can be an obstacle for such approaches, however, as they can only be trained on the small set of predicates that carry sufficient labels. We introduce the first scene graph prediction model that supports few-shot learning of predicates, enabling scene graph approaches to generalize to a set of new predicates. First, we introduce a new model of predicates as functions that operate on object features or image locations. Next, we define a scene graph model where these functions are trained as message passing protocols within a new graph convolution framework. We train the framework with a frequently occurring set of predicates and show that our approach outperforms those that use the same amount of supervision by 1.78 at recall@50 and performs on par with other scene graph models. Next, we extract object representations generated by the trained predicate functions to train few-shot predicate classifiers on rare predicates with as few as 1 labeled example. When compared to strong baselines like transfer learning from existing state-of-the-art representations, we show improved 5-shot performance by 4.16 recall@1. Finally, we show that our predicate functions generate interpretable visualizations, enabling the first interpretable scene graph model.



There are no comments yet.


page 6

page 8

page 13

page 14

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Scene graph prediction takes as input an image of a visual scene, and returns as output a set of relationships denoted as subject - predicate - object, such as woman - drinking - coffee and coffee - on - table. The goal is for these models to classify a large number of relationships for each image. However, due to the complexity of the task and uneven distribution of training relationship instances in the world and in training data, existing scene graph models are only performant with the most popular relationships (predicates). These existing models can be broadly divided into two approaches. The first approach detects the objects and then recognizes their pairwise relationships (dai2017detecting, ; liao2017natural, ; lu2016visual, ; yu2017visual, ). The second approach jointly infers the objects and their relationships li2017vip ; li2017scene ; xu2017scene based on object proposals. Both approaches treat relationship prediction as a multiclass predicate classification problem, given two object features. Such a formulation produces reasonable results as objects are a good indicator of relationships zellers2017neural . However, since the resulting object representations are utilized for both object as well as predicate classification, they confound the information required for both tasks. The representations, are therefore, not generalizable and can not be used to train the vast majority of less-frequently occurring predicates.

We present a new scene graph model that formulates predicates as functions, resulting in a scene graph model who’s object representations can be used for few-shot predicate prediction. Instead of using the object representations to predict predicates, we instead treat predicates as two individual functions: a forward function that transforms the subject representation into the object, and an inverse function that transforms the object representation back into the subject. We further introduce a new graph convolution framework that uses these functions as localized message passing protocols between object nodes kipf2016semi . To further ensure that the object representations are disentangled from encoding specific information about a predicate, we divide each forward and inverse function into two components: a spatial component that transforms attention over the image space krishna2018referring and a semantic component that operates over the object features zhang2017visual . Within each graph convolution step, each pair of object representations score the functions by checking which of them agree with the difference between their representations. These scores are then used to weight the transformations performed by the functions and used to update the object representations. After multiple iterations, the object representations are classified into object categories and the function weights that remain above a threshold result in a detected relationship.

By treating predicates as functions between object representations, our model is able to learn a meaningful embedding space that can be used for transfer learning of new few-shot predicate categories. For example, the forward function for riding learns to move the spatial attention to look below the subject to find the object and to move to a semantic location where rideable objects like car, skateboard, and bike can be found. We use the object representations generated by these functions to train few-shot predicate classifiers such as driving with as few as labeled example.

Through our experiments on Visual Genome krishna2017visual , a dataset containing visual relationship data, we show that the object representations generated by the predicate functions result in meaningful features that can be used to enable few-shot scene graph prediction, exceeding existing transfer learning approaches by at recall@ with labelled examples. We further justify our design decisions by demonstrating that our scene graph model performs on par with existing state-of-the-art models and even outperforms models that also do not utilize external knowledge bases gu2019scene , linguistic priors lu2016visual ; zellers2017neural

or rely on complicated pre- and post-processing heuristics 

zellers2017neural ; chen2019scene . We run ablations where we remove the semantic or spatial components of our functions and demonstrate that both components lead to increased performance but the semantic component is responsible for most of the performance. Finally, since our predicates are transformation functions, we can visualize them individually, enabling the first interpretable scene graph model.

2 Related work

Scene graphs were introduced as a formal representation for visual information (johnson2015image, ; krishna2017visual, ) in a form widely used in knowledge bases (guodong2005exploring, ; culotta2004dependency, ; zhou2007tree, ). Each scene graph encodes objects as nodes connected together by pairwise relationships as edges. Scene graphs have led to many state of the art models in image captioning (anderson2016spice, )

, image retrieval 

(johnson2015image, ; schuster2015generating, ), visual question answering (johnson2017inferring, ), relationship modeling (krishna2018referring, ), and image generation (johnson2018image, ). Given its versatile utility, the task of scene graph prediction has resulted in a series of publications (krishna2017visual, ; dai2017detecting, ; liang2017deep, ; li2017vip, ; li2017scene, ; newell2017pixels, ; xu2017scene, ; zellers2017neural, ; yang2018graph, ; herzig2018mapping, )

that have explored reinforcement learning 

(liang2017deep, ), structured prediction (krahenbuhl2011efficient, ; desai2011discriminative, ; tu2010auto, ), utilizing object attributes (farhadi2009describing, ; parikh2011relative, ), sequential prediction (newell2017pixels, ), and graph-based (xu2017scene, ; li2018factorizable, ; yang2018graph, ) approaches. However, all of these approaches have classified predicates using object features, confounding the object features with predicate information that prevents their utility when used to train new few-shot predicate categories.

Predicates and relationships. The strategy of decomposing relationships into their corresponding objects and predicates has been recognized in other works (li2018factorizable, ; yang2018graph, )

but we generalize existing methods by treating predicates as functions, implemented as general neural network modules. Recent work on referring relationships showed that predicates can be learned as spatial transformations in visual attention 

(krishna2018referring, ). We extend this idea to formulate predicates as message passing semantic and spatial functions in a graph convolution framework. This framework generalizes existing work (li2018factorizable, ; yang2018graph, )

where relationships are usually treated as latent representations instead of functions. It also generalizes papers that have restricted these functions to linear transformations 

(bordes2013translating, ; zhang2017visual, ).

Graph convolutions. Modeling graphical data has historically been challenging, especially when dealing with large amounts of data (weston2012deep, ; belkin2006manifold, ; zhou2004learning, ). Traditional methods have relied on Laplacian regularization through label propagation (zhou2004learning, ), manifold regularization (belkin2006manifold, ), or learning embeddings (weston2012deep, ). Recently, operators on local neighborhoods of nodes have become popular with their ability to scale to larger amounts of data and parallelizable computation (grover2016node2vec, ; perozzi2014deepwalk, ). Inspired by these Laplacian-based, local operations, graph convolutions (kipf2016semi, ) have become the de facto choice when dealing with graphical data (kipf2016semi, ; scarselli2009graph, ; li2015gated, ; henaff2015deep, ; duvenaud2015convolutional, ; niepert2016learning, ). Graph convolutions have recently been combined with RCNN (girshick2015fast, ) to perform scene graph detection (yang2018graph, ; johnson2018image, ). Unlike most graph convolution methods, which assume a known graph structure, our framework doesn’t make any prior assumptions to limit the types of relationships between any two object nodes, i.e. we don’t use relationship proposals to limit the possible edges. Instead, we learn to score the predicate functions between the nodes, strengthening the correct relationships and weakening the incorrect ones over multiple iterations.

Few-shot prediction. While graph-based learning typically requires large amounts of training data, we extend work in few-shot prediction, to show how the object representations learned using predicate functions can be further used to transfer to rare predicates. The few-shot literature is broadly divided into two main frameworks. The first strategy learns a classifier for a set of frequent categories and then uses them to learn the few-shot categories (koch2015siamese, ; vinyals2016matching, ; triantafillou2017few, ; garcia2017few, ). The second strategy learns invariances or decompositions that enable few-shot classification (fe2003bayesian, ; fei2006one, ; lake2011one, ; snell2017prototypical, ; mehrotra2017generative, ; chen2019scene, ). Our framework more closely resembles the first framework because we use the object representations learned using the frequent predicates to identify few-shot relationships with rare predicates.

Modular neural networks

have been successful in numerous machine learning applications 

(andreas2016neural, ; kumar2016ask, ; xiong2016dynamic, ; andreas2016learning, ; johnson2017inferring, ). Typically, their utility has focused on the ability to train individual components and then jointly fine-tune them. Our paper focuses on a complementary ability of such networks: our functions are trained together and then used to learn additional predicates without retraining the entire model.

Figure 1: We introduce a scene graph approach that formulates predicates as learned functions, which result in an embedding space for objects that is effective for few-shot. Our formulation treats predicates as learned semantic and spatial functions, which are trained within a graph convolution network. First, we extract bounding box proposals from an input image and represent objects as semantic features and spatial attentions. Next, we construct a fully connected graph where object representations form the nodes and the predicate functions act as edges. Here we show how one node, the person’s representation is updated within one graph convolution step.

3 Graph convolution framework with predicate functions

In this section, we describe our graph convolution framework (Figure 1) and the predicate functions.

Problem formulation. Our goal is to learn effective predicate functions whose transformations result in effective object embeddings. We will use these functions for the task of scene graph generation in a graph convolution framework. Formally, the input to our model is an image from which we extract a set of bounding box proposals using a region proposal network (ren2015faster, ). From these bounding boxes, we extract initial object features . These boxes and features are sent to our graph convolution framework. The final output of our model is a scene graph denoted as with nodes (objects) , and labeled edges (relationships) , where is one of predicate categories.

Traditional graph convolutional network. Our model is primarily motivated as an extension to graph convolutional networks that operate on local graph neighborhoods (duvenaud2015convolutional, ; schlichtkrull2017modeling, ; kipf2016semi, ). These methods can be understood as simple message passing frameworks (gilmer2017neural, ):



is a hidden representation of node

in the iteration, and are respectively aggregation and vertex update functions that accumulate information from the other nodes. is the set of neighbors of in the graph.

Our graph convolutional network. Similar to previous work (schlichtkrull2017modeling, ) which used multiple edge categories, we expand the above formulation to support multiple edge types, i.e. given two nodes and , an edge exists from to for all predicate categories. Unlike previous work where edges are an input schlichtkrull2017modeling , we initialize a fully connected graph, i.e. all objects are connected to all other objects by all predicate edges. If after the graph messages are passed, predicate

is scored above a hyperparameter threshold, then that relationship

is part of the generated scene graph. The updated equations are then,


where are learned message functions between two nodes for the predicate , which we will detail later in this section. Note that this formula is a generalized version of the exact representation used in the previous work (schlichtkrull2017modeling, ), where if and otherwise, and is the sigmoid activation. Here, is a normalizing constant for the edge as defined in previous work (schlichtkrull2017modeling, ).

Node hidden representations. With the overall update step for each node defined, we now explain the hidden object representation . Traditionally, object nodes in graph models are defined as being a -dimensional representation of the node  (duvenaud2015convolutional, ; schlichtkrull2017modeling, ; kipf2016semi, ). However, in our case, we want these hidden representations to encode both the semantic information for each object proposal as well as its spatial location in the image. These two components will be separately utilized by the semantic and spatial predicate functions. Instead of asking our model to learn to represent both of these pieces of information, we built invariances into our representation such that it knows to encode them both explicitly. Specifically, we define each hidden representation as a tuple of two entries: — a semantic object feature and a spatial attention map over the image . In practice, we extract from the penultimate layer in ResNet-50 He2015 and set as a mask with for the pixels within the object proposal and outside.

With the semantic and spatial separation, we can rewrite equation 2:


Note that does not get updated because we fix the object masks for each object.

Predicate functions. To define , we introduce the semantic () and spatial () predicate functions for predicate

. Semantic functions are multi-layer perceptrons (MLP) while spatial functions are convolution layers, each with

layers and ReLU activations. Previous work on multi-graph convolutions 

(schlichtkrull2017modeling, ) assumed that they had a priori information about the structure of the graph, i.e. which edges exist between any two nodes. In our case, we are attempting to perform both node classification as well as edge prediction simultaneously. Without knowing which edges actually exist in the graph, we would be adding a lot of noise if we allowed every predicate to equally influence another node. To circumvent this issue, we first calculate a score for each predicate :


where is a hyperparameter, is the cosine distance function, and is the differentiable intersection over union function that measures the similarity between two soft heatmaps. This gives us a score for how likely the node believes that the edge exists. Similar to recent work (krishna2018referring, ), shifts the spatial attention from to where it thinks node should be. It encodes the spatial properties of the predicate we are learning and ignores the object features. To complement the spatial predicate function, we use to transform . This shifted representation is what the model expects to be similar to . By using both the spatial and semantic score in our update of , the two representations interact with one another. So, even though these components are separate, they create a cohesive score for each predicate. This score is used to weight how much node will influence node through a predicate in the update in equation 2. We can now define:


represents the backward predicate function from object back to the subject. For example, given the relationship person - riding - snowboard, our model not only learns how to transform person using the function riding, but also how to transform snowboard to person by using the inverse predicate . Learning both the forward and backward functions per predicate allows us to pass messages in both directions even though our predicates are directed edges.

Hidden representation update. We now define that accumulate the messages passed by the semantic predicate functions to update the semantic object representation:


where is learned weight. The spatial representation does not get updated because the spatial location of an object does not move.

Scene graph output. Finally, we predict the categories of each node using , where

is an MLP that generates a probability distribution over all the possible object categories. Each possible relationship

is output as a relationship only if where the total number of iterations in the model and a threshold hyperparameter.

Figure 2: Overview of our few-shot training framework. We use the learned predicate function from the graph convolution framework to generate embeddings and attention masks for the object representations. These representations are used to train few-shot predicate classifiers.

4 Few-shot predicate framework

With our semantic () and spatial () predicate functions trained for the frequent predicates , we now utilize these functions to create object representations to train few-shot predicates. We design few-shot predicate classifiers to be MLPs with layers with ReLU activations between layers. We assume that rare predicates are and only have examples each.

The intuition behind our -shot training scheme lies in the modularity of predicates and their shared semantic and spatial components. By decomposing the predicate representations from the object in the graph convolutions, we create an representation space that supports predicate transformations. We will show in our experiments that our embeddings space places semantically similar objects that participate in similar relationships together. Now, when training with few examples of rare predicates, such as driving, we can rely on the semantic embeddings for objects that were clustered by riding.

We pass all labelled examples of a predicate pair of objects through the learned predicate functions and extract the hidden representations and from the final graph convolution layer. We concatenate these transformations along the channel dimension and feed them as an input to the few-shot classifiers. We train the -shot classifiers by minimizing the cross-entropy loss against the labelled examples amongst rare categories.

Metric recall@50 recall@100 recall@50 recall@100 recall@50 recall@100

vision only

IMP (xu2017scene, ) 06.40 08.00 20.60 22.40 40.80 45.20
MSDN (li2017scene, ) 07.00 09.10 27.60 29.90 53.20 57.90
MotifNet-freq (zellers2017neural, ) 06.90 09.10 23.80 27.20 41.80 48.80
Graph R-CNN (yang2018graph, ) 11.40 13.70 29.60 31.60 54.20 59.10
Our full model 13.18 13.45 23.71 24.66 56.65 57.21


Factorizable Net (li2018factorizable, ) 13.06 16.47 - - - -
KB-GAN (gu2019scene, ) 13.65 17.57 - - - -
MotifNet (zellers2017neural, ) 27.20 30.30 35.80 36.50 65.20 67.10
PI-SG (herzig2018mapping, ) - - 36.50 38.80 65.10 66.90


Our spatial only 02.05 02.32 03.92 04.54 04.19 04.50
Our semantic only 12.92 12.39 23.35 24.00 56.02 56.67
Our full model 13.18 13.45 23.71 24.66 56.65 57.21
Table 1: We perform on par with all existing state-of-the-art scene graph approaches and even outperform other methods that only utilize Visual Genome’s data as supervision. We also report ablations by separating the contribution of the semantic and the spatial components.
Figure 3: Example scene graphs predicted by our graph convolution fully-trained model.

5 Experiments

We begin our evaluation by first describing the dataset, evaluation metrics, and baselines. Our first experiment studies our graph convolution framework and compares our scene graph prediction performance against existing state-of-the-art methods. Our second experiment tests the utility of our approach on our main objective of enabling few-shot scene graph prediction. Finally, our third experiment showcases interpretable visualizations by visualizing the predicate transformations.

Dataset: We use the Visual Genome (krishna2017visual, ) dataset for training, validation and testing. To benchmark against existing scene graph approaches, we use the commonly used subset of object and predicate categories (xu2017scene, ; zellers2017neural, ; yang2018graph, ). We use publicly available pre-processed splits of train and test data, and sample a validation set from the training set (zellers2017neural, ). The training, validation, and test sets contain and and images, respectively.

Evaluation metrics: For scene graph prediction, we use three evaluation tasks, all of which are evaluated at recall@ and recall@. (1) PredCls predicts predicate categories, given ground truth bounding boxes and object classes, (2) SGCls predicts predicate and object categories given ground truth bounding boxes, and (3) SGGen detects object locations, categories and predicate categories.

Metrics based on recall require ranking predictions. For PredCls this means a simple ranking of predicted predicates by score. For SGCls this means ranking subject-predicate-object tuples by a product of subject, object, and predicate scores. For SGGen this means a similar product as SGCls, but tuples without correct subject or object localizations are not counted as correct. We refer readers to previous work that defined these metrics for further reading (lu2016visual, ).

For few-shot prediction, we report recall@ and recall@ on the task of PredCls. We vary the number of labeled examples available for training few-shot predicate classifiers from . We also report recall@ in addition to the traditional recall@ because each image only has a few instances of rare predicates in the test set.

Figure 4: We show Recall@ and Recall@ results on -shot predicates. We outperform strong baselines like transfer learning on MotifNet zellers2017neural , which also relies on linguistic priors.

Baselines: We classify existing methods into two categories. The first category includes other scene graph approaches that, like our approach, only utilizes Visual Genome’s data as supervision. This includes Iterative Message Passing (IMP(xu2017scene, ), Multi-level scene Description Network (MSDN(li2017scene, ), ViP-CNN (li2017vip, ), MotifNet-freq zellers2017neural . The second category includes models such as Factorizable Net (li2018factorizable, ), KB-GAN (gu2019scene, ) and MotifNet zellers2017neural

, which use linguistic priors in the form of word vectors or external information from knowledge bases while

MotifNet also deploys a custom trained object detector, class-conditioned non-maximum suppression, and heuristically removes all object pairs that do not overlap. While not comparable, we report their numbers for clarity.

5.1 Scene graph prediction

We report scene graph prediction numbers on Visual Genome krishna2017visual in Table 1. This experiment is meant to serve as a benchmark against existing scene graph approaches. We outperform existing models that only use Visual Genome supervision for SGGen and PredCls by and recall@, respectfully. But we fall short on recall@. As we move from recall@ to recall@, models are evaluated on their top predictions instead of their top . Unlike other models that perform a multi-class classifiction of predicates for every object pair, we assign binary scores to each possible predicate between an object pair individaully. Therefore, we can report that no relationship exists between a pair of objects. While this design decision allows us to separate learning predicates transformations and object representations, it penalizes our model for not guessing relationships for every single object pair, thereby, reducing our recall@ scores. We also notice that since our model doesn’t utilize the object categories to make relationship predictions, it performs worse for the task of SGCls, which presents models with ground truth object locations.

We also report ablations of our model trained using only the semantic or spatial functions. We observe that different ablations of the model perform better on certain types of predicates. The spatial model performs well on predicates that have a clear spatial or location-based aspect, such as above and under. The semantic model performs better on non-spatial predicates such as has and holding. Our full model outperforms the individual semantic-only and spatial-only models as predicates can utilize both components. We visualize some scene graphs generated by our network in Figure 6.

5.2 Few-shot prediction

Our second experiment studies how well we perform few-shot scene graph prediction with limited examples per predicate. Our approach requires two sets of predicates, a set of frequently occurring predicates and a second set of rare predicates with only examples. we split the usual predicates typically used in Visual Genome, and place the most predicates with the most training examples into the first set and place the remaining predicates into the second set. In our experiments, we train the predicate functions and the graph convolution framework using the predicates in the first set. Next, we use them to train -shot classifiers for the rare predicates in the second set by utilizing the representations generated by the pretrained predicate functions. We iterate over .

For a rigorous comparison, we choose to compare our method against MotifNet zellers2017neural , which outperforms all existing scene graph approaches and uses linguistic priors from word embeddings and heuristic post-processing to generate high-quality scene graphs. Specifically, we report two different training variants of MotifNet: MotifNet-Baseline, which is initialized with random weights and trained only using labelled examples and MotifNet-Transfer, which is first trained on the frequent predicates and then finetuned on the few-shot predicates. We also compare against Ours-Baseline, which trains our graph convolution framework on the few-shot predicates and Ours-Oracle, which reports the upper bound performance when trained with all of Visual Genome.

Results in Figure 4 outline that our method performs better than all baseline comparisons for all values of . We find that our learned classifiers are similar in performance to MotifNet-Transfer when . This is likely because MotifNet-Transfer also has access to additional information available from word embeddings. The improvements seen by our approach increase as increases to , where we outperform the baselines by recall@. Eventually, as more labels becomes available, the Neural Motif model outperforms our model for values of .

5.3 Interpretable predicate transformation visualizations

Our final experiment showcases another utility of treating predicates as functions. Once trained, these functions can be individually visualized and qualitatively evaluated. Figure 5(left and middle) shows examples of transforming spatial attention from four instances of person, horse, boy, and banana in four images. We see that above and standing on moves attention below the person looking moves attention left towards the direction the horse is looking. wearing highlights the center of the boy. Figure 5(right) shows semantic transformations applied to the embedding representation space of objects. We see that riding transforms the embedding to a space that contains objects like wave, skateboard, bike and horse. Notice that unlike linguistic word embeddings, which are trained to place words found in similar contexts together, our embedding space represents the types of visual relationships that objects participate. We include more visualizations in our appendix.

Figure 5: (left, middle) Spatial transformations learned by our model applied to object masks in images. (right) Semantic transformations applied to the average object category embedding; we show the nearest neighboring object categories to the transformed subject.

6 Conclusion

We introduced the first scene graph prediction model that treats predicates as functions and generates object representations that can effectively enable few-shot learning. We treat predicates as neural network transformations between object representations. The functions disentangle the object representations from storing predicate information, and instead generates an embedding space with objects that embed similar relationships close together. Our representations outperform existing methods for few-shot predicate prediction, a valuable task since most predicates occur infrequently. Also, our graph convolution network, which trains the predicate functions, performs on par with existing scene graph prediction state-of-the-art models. Finally, the predicate functions result in interpretable visualizations, allowing us to visualize the spatial and semantic transformations learned for each predicate.

Acknowledgements We thank Iro Armeni, Suraj Nair, Vincent Chen, and Eric Li for their helpful comments. This work was partially funded by the Brown Institute of Media Innovation and by Toyota Research Institute (“TRI”) but this article solely reflects the opinions and conclusions of its authors and not TRI or any other Toyota entity.


7 Appendix

We include additional scene graph outputs by our graph convolution model, include visualizations for the spatial and the semantic transformations and finally plot a visualization of the object feature space.

7.1 More scene graph model outputs

Figure 6 shows more examples of scene graphs generated by our model. The scene graph for the image in the middle of a woman riding a motorcycle shows that our model is able to identify the main action taking place in the image. It is also able to correctly identify parts of the motorcycle, such as seat, tire, and light. The scene graph and image in the bottom right shows that our model can identify parts of the woman’s body, like nose and leg. It is also able to predict the woman’s actions: carrying the bag and holding the umbrella.

Figure 6: Example scene graphs generated by our graph convolution fully-trained model.

7.2 Semantic transformations

Table 2 shows more examples of semantic transformations applied to the embedding feature space of objects. child transformed by walking on resembles objects that we walk on: street, sidewalk, and snow. We also learn more specific and rare relationships such as attached to. We observe that sign transformed by attached to most closely resembles objects such as pole and fence.

subject object closest objects
girl riding wave, skateboard, bike, horse
man wears shirt, jacket, hat, cap, helmet
person has hair, head, face, arm, ear
dog laying on bed, beach, bench, desk, table
child walking on street, sidewalk, snow, beach
boy sitting on bench, bed, desk, chair, toilet
umbrella covering kid, people, skier, person, guy
tail belonging to cat, elephant, giraffe, dog
stand over street, sidewalk, beach, hill
mountain and hill, mountain, skier, snow
motorcycle parked on street, sidewalk, snow, beach
sign attached to pole, fence, shelf, post, building
sidewalk in front of building, room, house, fence
kid watching giraffe, zebra, plane, horse
men looking at airplane, plane, bus, laptop
child standing on sidewalk, beach, snow, track
guy holding racket, umbrella, glass, bag
motorcycle has heel wing handle tire engine
Table 2: We visualize a predicate’s semantic transformations by showing the closest objects to a given transformed subject.

7.3 Inverse predicate functions

To understand the effect of including inverse predicate functions, we performed an ablation study where the inverse predicate functions were omitted from the model. We found that the semantic-only model trained without inverse functions performed worse on recall@ than the semantic model with inverse functions.

We also visualize how these inverse functions transform a particular subject when compared to the output of the forward function as shown in Figure 7. We observe that the spatial function for the predicate riding shifts attention below the person in the image. Qualitatively, this is the expected result because the skateboard is below the person. The inverse transformation of riding shifts the skateboard mask slightly above the skateboard. Similarly, this is also the expected result because skateboarders are typically above their boards.

Figure 7: We visualize inverse predicate function transformations.

7.4 Visualize Object Representations

In the process of training our predicate functions, we learn representations for each object instance we encounter. From the embedding of each object instance, we calculate the average object category embedding. Each of the distinct object categories is embedded into a learned -dimension space. Figure 8 shows a t-SNE visualization of these embeddings. We observe object categories that participate in similar relationships grouped together. For example, embeddings for bird, cow, bear, and other animals are close together (inside the red rectangle).

Figure 8: We show a -dimensionl tSNE visualization of the object category embeddings learned by our model.