Scene graph prediction takes as input an image of a visual scene, and returns as output a set of relationships denoted as subject - predicate - object, such as woman - drinking - coffee and coffee - on - table. The goal is for these models to classify a large number of relationships for each image. However, due to the complexity of the task and uneven distribution of training relationship instances in the world and in training data, existing scene graph models are only performant with the most popular relationships (predicates). These existing models can be broadly divided into two approaches. The first approach detects the objects and then recognizes their pairwise relationships (dai2017detecting, ; liao2017natural, ; lu2016visual, ; yu2017visual, ). The second approach jointly infers the objects and their relationships li2017vip ; li2017scene ; xu2017scene based on object proposals. Both approaches treat relationship prediction as a multiclass predicate classification problem, given two object features. Such a formulation produces reasonable results as objects are a good indicator of relationships zellers2017neural . However, since the resulting object representations are utilized for both object as well as predicate classification, they confound the information required for both tasks. The representations, are therefore, not generalizable and can not be used to train the vast majority of less-frequently occurring predicates.
We present a new scene graph model that formulates predicates as functions, resulting in a scene graph model who’s object representations can be used for few-shot predicate prediction. Instead of using the object representations to predict predicates, we instead treat predicates as two individual functions: a forward function that transforms the subject representation into the object, and an inverse function that transforms the object representation back into the subject. We further introduce a new graph convolution framework that uses these functions as localized message passing protocols between object nodes kipf2016semi . To further ensure that the object representations are disentangled from encoding specific information about a predicate, we divide each forward and inverse function into two components: a spatial component that transforms attention over the image space krishna2018referring and a semantic component that operates over the object features zhang2017visual . Within each graph convolution step, each pair of object representations score the functions by checking which of them agree with the difference between their representations. These scores are then used to weight the transformations performed by the functions and used to update the object representations. After multiple iterations, the object representations are classified into object categories and the function weights that remain above a threshold result in a detected relationship.
By treating predicates as functions between object representations, our model is able to learn a meaningful embedding space that can be used for transfer learning of new few-shot predicate categories. For example, the forward function for riding learns to move the spatial attention to look below the subject to find the object and to move to a semantic location where rideable objects like car, skateboard, and bike can be found. We use the object representations generated by these functions to train few-shot predicate classifiers such as driving with as few as labeled example.
Through our experiments on Visual Genome krishna2017visual , a dataset containing visual relationship data, we show that the object representations generated by the predicate functions result in meaningful features that can be used to enable few-shot scene graph prediction, exceeding existing transfer learning approaches by at recall@ with labelled examples. We further justify our design decisions by demonstrating that our scene graph model performs on par with existing state-of-the-art models and even outperforms models that also do not utilize external knowledge bases gu2019scene , linguistic priors lu2016visual ; zellers2017neural
or rely on complicated pre- and post-processing heuristicszellers2017neural ; chen2019scene . We run ablations where we remove the semantic or spatial components of our functions and demonstrate that both components lead to increased performance but the semantic component is responsible for most of the performance. Finally, since our predicates are transformation functions, we can visualize them individually, enabling the first interpretable scene graph model.
2 Related work
Scene graphs were introduced as a formal representation for visual information (johnson2015image, ; krishna2017visual, ) in a form widely used in knowledge bases (guodong2005exploring, ; culotta2004dependency, ; zhou2007tree, ). Each scene graph encodes objects as nodes connected together by pairwise relationships as edges. Scene graphs have led to many state of the art models in image captioning (anderson2016spice, )johnson2015image, ; schuster2015generating, ), visual question answering (johnson2017inferring, ), relationship modeling (krishna2018referring, ), and image generation (johnson2018image, ). Given its versatile utility, the task of scene graph prediction has resulted in a series of publications (krishna2017visual, ; dai2017detecting, ; liang2017deep, ; li2017vip, ; li2017scene, ; newell2017pixels, ; xu2017scene, ; zellers2017neural, ; yang2018graph, ; herzig2018mapping, )
that have explored reinforcement learning(liang2017deep, ), structured prediction (krahenbuhl2011efficient, ; desai2011discriminative, ; tu2010auto, ), utilizing object attributes (farhadi2009describing, ; parikh2011relative, ), sequential prediction (newell2017pixels, ), and graph-based (xu2017scene, ; li2018factorizable, ; yang2018graph, ) approaches. However, all of these approaches have classified predicates using object features, confounding the object features with predicate information that prevents their utility when used to train new few-shot predicate categories.
but we generalize existing methods by treating predicates as functions, implemented as general neural network modules. Recent work on referring relationships showed that predicates can be learned as spatial transformations in visual attention(krishna2018referring, ). We extend this idea to formulate predicates as message passing semantic and spatial functions in a graph convolution framework. This framework generalizes existing work (li2018factorizable, ; yang2018graph, )
where relationships are usually treated as latent representations instead of functions. It also generalizes papers that have restricted these functions to linear transformations(bordes2013translating, ; zhang2017visual, ).
Graph convolutions. Modeling graphical data has historically been challenging, especially when dealing with large amounts of data (weston2012deep, ; belkin2006manifold, ; zhou2004learning, ). Traditional methods have relied on Laplacian regularization through label propagation (zhou2004learning, ), manifold regularization (belkin2006manifold, ), or learning embeddings (weston2012deep, ). Recently, operators on local neighborhoods of nodes have become popular with their ability to scale to larger amounts of data and parallelizable computation (grover2016node2vec, ; perozzi2014deepwalk, ). Inspired by these Laplacian-based, local operations, graph convolutions (kipf2016semi, ) have become the de facto choice when dealing with graphical data (kipf2016semi, ; scarselli2009graph, ; li2015gated, ; henaff2015deep, ; duvenaud2015convolutional, ; niepert2016learning, ). Graph convolutions have recently been combined with RCNN (girshick2015fast, ) to perform scene graph detection (yang2018graph, ; johnson2018image, ). Unlike most graph convolution methods, which assume a known graph structure, our framework doesn’t make any prior assumptions to limit the types of relationships between any two object nodes, i.e. we don’t use relationship proposals to limit the possible edges. Instead, we learn to score the predicate functions between the nodes, strengthening the correct relationships and weakening the incorrect ones over multiple iterations.
Few-shot prediction. While graph-based learning typically requires large amounts of training data, we extend work in few-shot prediction, to show how the object representations learned using predicate functions can be further used to transfer to rare predicates. The few-shot literature is broadly divided into two main frameworks. The first strategy learns a classifier for a set of frequent categories and then uses them to learn the few-shot categories (koch2015siamese, ; vinyals2016matching, ; triantafillou2017few, ; garcia2017few, ). The second strategy learns invariances or decompositions that enable few-shot classification (fe2003bayesian, ; fei2006one, ; lake2011one, ; snell2017prototypical, ; mehrotra2017generative, ; chen2019scene, ). Our framework more closely resembles the first framework because we use the object representations learned using the frequent predicates to identify few-shot relationships with rare predicates.
Modular neural networks
have been successful in numerous machine learning applications(andreas2016neural, ; kumar2016ask, ; xiong2016dynamic, ; andreas2016learning, ; johnson2017inferring, ). Typically, their utility has focused on the ability to train individual components and then jointly fine-tune them. Our paper focuses on a complementary ability of such networks: our functions are trained together and then used to learn additional predicates without retraining the entire model.
3 Graph convolution framework with predicate functions
In this section, we describe our graph convolution framework (Figure 1) and the predicate functions.
Problem formulation. Our goal is to learn effective predicate functions whose transformations result in effective object embeddings. We will use these functions for the task of scene graph generation in a graph convolution framework. Formally, the input to our model is an image from which we extract a set of bounding box proposals using a region proposal network (ren2015faster, ). From these bounding boxes, we extract initial object features . These boxes and features are sent to our graph convolution framework. The final output of our model is a scene graph denoted as with nodes (objects) , and labeled edges (relationships) , where is one of predicate categories.
Traditional graph convolutional network. Our model is primarily motivated as an extension to graph convolutional networks that operate on local graph neighborhoods (duvenaud2015convolutional, ; schlichtkrull2017modeling, ; kipf2016semi, ). These methods can be understood as simple message passing frameworks (gilmer2017neural, ):
is a hidden representation of nodein the iteration, and are respectively aggregation and vertex update functions that accumulate information from the other nodes. is the set of neighbors of in the graph.
Our graph convolutional network. Similar to previous work (schlichtkrull2017modeling, ) which used multiple edge categories, we expand the above formulation to support multiple edge types, i.e. given two nodes and , an edge exists from to for all predicate categories. Unlike previous work where edges are an input schlichtkrull2017modeling , we initialize a fully connected graph, i.e. all objects are connected to all other objects by all predicate edges. If after the graph messages are passed, predicate
is scored above a hyperparameter threshold, then that relationshipis part of the generated scene graph. The updated equations are then,
where are learned message functions between two nodes for the predicate , which we will detail later in this section. Note that this formula is a generalized version of the exact representation used in the previous work (schlichtkrull2017modeling, ), where if and otherwise, and is the sigmoid activation. Here, is a normalizing constant for the edge as defined in previous work (schlichtkrull2017modeling, ).
Node hidden representations. With the overall update step for each node defined, we now explain the hidden object representation . Traditionally, object nodes in graph models are defined as being a -dimensional representation of the node (duvenaud2015convolutional, ; schlichtkrull2017modeling, ; kipf2016semi, ). However, in our case, we want these hidden representations to encode both the semantic information for each object proposal as well as its spatial location in the image. These two components will be separately utilized by the semantic and spatial predicate functions. Instead of asking our model to learn to represent both of these pieces of information, we built invariances into our representation such that it knows to encode them both explicitly. Specifically, we define each hidden representation as a tuple of two entries: — a semantic object feature and a spatial attention map over the image . In practice, we extract from the penultimate layer in ResNet-50 He2015 and set as a mask with for the pixels within the object proposal and outside.
With the semantic and spatial separation, we can rewrite equation 2:
Note that does not get updated because we fix the object masks for each object.
Predicate functions. To define , we introduce the semantic () and spatial () predicate functions for predicate
. Semantic functions are multi-layer perceptrons (MLP) while spatial functions are convolution layers, each with
layers and ReLU activations. Previous work on multi-graph convolutions(schlichtkrull2017modeling, ) assumed that they had a priori information about the structure of the graph, i.e. which edges exist between any two nodes. In our case, we are attempting to perform both node classification as well as edge prediction simultaneously. Without knowing which edges actually exist in the graph, we would be adding a lot of noise if we allowed every predicate to equally influence another node. To circumvent this issue, we first calculate a score for each predicate :
where is a hyperparameter, is the cosine distance function, and is the differentiable intersection over union function that measures the similarity between two soft heatmaps. This gives us a score for how likely the node believes that the edge exists. Similar to recent work (krishna2018referring, ), shifts the spatial attention from to where it thinks node should be. It encodes the spatial properties of the predicate we are learning and ignores the object features. To complement the spatial predicate function, we use to transform . This shifted representation is what the model expects to be similar to . By using both the spatial and semantic score in our update of , the two representations interact with one another. So, even though these components are separate, they create a cohesive score for each predicate. This score is used to weight how much node will influence node through a predicate in the update in equation 2. We can now define:
represents the backward predicate function from object back to the subject. For example, given the relationship person - riding - snowboard, our model not only learns how to transform person using the function riding, but also how to transform snowboard to person by using the inverse predicate . Learning both the forward and backward functions per predicate allows us to pass messages in both directions even though our predicates are directed edges.
Hidden representation update. We now define that accumulate the messages passed by the semantic predicate functions to update the semantic object representation:
where is learned weight. The spatial representation does not get updated because the spatial location of an object does not move.
Scene graph output. Finally, we predict the categories of each node using , where
is an MLP that generates a probability distribution over all the possible object categories. Each possible relationshipis output as a relationship only if where the total number of iterations in the model and a threshold hyperparameter.
4 Few-shot predicate framework
With our semantic () and spatial () predicate functions trained for the frequent predicates , we now utilize these functions to create object representations to train few-shot predicates. We design few-shot predicate classifiers to be MLPs with layers with ReLU activations between layers. We assume that rare predicates are and only have examples each.
The intuition behind our -shot training scheme lies in the modularity of predicates and their shared semantic and spatial components. By decomposing the predicate representations from the object in the graph convolutions, we create an representation space that supports predicate transformations. We will show in our experiments that our embeddings space places semantically similar objects that participate in similar relationships together. Now, when training with few examples of rare predicates, such as driving, we can rely on the semantic embeddings for objects that were clustered by riding.
We pass all labelled examples of a predicate pair of objects through the learned predicate functions and extract the hidden representations and from the final graph convolution layer. We concatenate these transformations along the channel dimension and feed them as an input to the few-shot classifiers. We train the -shot classifiers by minimizing the cross-entropy loss against the labelled examples amongst rare categories.
|SG GEN||SG CLS||PRED CLS|
|IMP (xu2017scene, )||06.40||08.00||20.60||22.40||40.80||45.20|
|MSDN (li2017scene, )||07.00||09.10||27.60||29.90||53.20||57.90|
|MotifNet-freq (zellers2017neural, )||06.90||09.10||23.80||27.20||41.80||48.80|
|Graph R-CNN (yang2018graph, )||11.40||13.70||29.60||31.60||54.20||59.10|
|Our full model||13.18||13.45||23.71||24.66||56.65||57.21|
|Factorizable Net (li2018factorizable, )||13.06||16.47||-||-||-||-|
|KB-GAN (gu2019scene, )||13.65||17.57||-||-||-||-|
|MotifNet (zellers2017neural, )||27.20||30.30||35.80||36.50||65.20||67.10|
|PI-SG (herzig2018mapping, )||-||-||36.50||38.80||65.10||66.90|
|Our spatial only||02.05||02.32||03.92||04.54||04.19||04.50|
|Our semantic only||12.92||12.39||23.35||24.00||56.02||56.67|
|Our full model||13.18||13.45||23.71||24.66||56.65||57.21|
We begin our evaluation by first describing the dataset, evaluation metrics, and baselines. Our first experiment studies our graph convolution framework and compares our scene graph prediction performance against existing state-of-the-art methods. Our second experiment tests the utility of our approach on our main objective of enabling few-shot scene graph prediction. Finally, our third experiment showcases interpretable visualizations by visualizing the predicate transformations.
Dataset: We use the Visual Genome (krishna2017visual, ) dataset for training, validation and testing. To benchmark against existing scene graph approaches, we use the commonly used subset of object and predicate categories (xu2017scene, ; zellers2017neural, ; yang2018graph, ). We use publicly available pre-processed splits of train and test data, and sample a validation set from the training set (zellers2017neural, ). The training, validation, and test sets contain and and images, respectively.
Evaluation metrics: For scene graph prediction, we use three evaluation tasks, all of which are evaluated at recall@ and recall@. (1) PredCls predicts predicate categories, given ground truth bounding boxes and object classes, (2) SGCls predicts predicate and object categories given ground truth bounding boxes, and (3) SGGen detects object locations, categories and predicate categories.
Metrics based on recall require ranking predictions. For PredCls this means a simple ranking of predicted predicates by score. For SGCls this means ranking subject-predicate-object tuples by a product of subject, object, and predicate scores. For SGGen this means a similar product as SGCls, but tuples without correct subject or object localizations are not counted as correct. We refer readers to previous work that defined these metrics for further reading (lu2016visual, ).
For few-shot prediction, we report recall@ and recall@ on the task of PredCls. We vary the number of labeled examples available for training few-shot predicate classifiers from . We also report recall@ in addition to the traditional recall@ because each image only has a few instances of rare predicates in the test set.
Baselines: We classify existing methods into two categories. The first category includes other scene graph approaches that, like our approach, only utilizes Visual Genome’s data as supervision. This includes Iterative Message Passing (IMP) (xu2017scene, ), Multi-level scene Description Network (MSDN) (li2017scene, ), ViP-CNN (li2017vip, ), MotifNet-freq zellers2017neural . The second category includes models such as Factorizable Net (li2018factorizable, ), KB-GAN (gu2019scene, ) and MotifNet zellers2017neural
, which use linguistic priors in the form of word vectors or external information from knowledge bases whileMotifNet also deploys a custom trained object detector, class-conditioned non-maximum suppression, and heuristically removes all object pairs that do not overlap. While not comparable, we report their numbers for clarity.
5.1 Scene graph prediction
We report scene graph prediction numbers on Visual Genome krishna2017visual in Table 1. This experiment is meant to serve as a benchmark against existing scene graph approaches. We outperform existing models that only use Visual Genome supervision for SGGen and PredCls by and recall@, respectfully. But we fall short on recall@. As we move from recall@ to recall@, models are evaluated on their top predictions instead of their top . Unlike other models that perform a multi-class classifiction of predicates for every object pair, we assign binary scores to each possible predicate between an object pair individaully. Therefore, we can report that no relationship exists between a pair of objects. While this design decision allows us to separate learning predicates transformations and object representations, it penalizes our model for not guessing relationships for every single object pair, thereby, reducing our recall@ scores. We also notice that since our model doesn’t utilize the object categories to make relationship predictions, it performs worse for the task of SGCls, which presents models with ground truth object locations.
We also report ablations of our model trained using only the semantic or spatial functions. We observe that different ablations of the model perform better on certain types of predicates. The spatial model performs well on predicates that have a clear spatial or location-based aspect, such as above and under. The semantic model performs better on non-spatial predicates such as has and holding. Our full model outperforms the individual semantic-only and spatial-only models as predicates can utilize both components. We visualize some scene graphs generated by our network in Figure 6.
5.2 Few-shot prediction
Our second experiment studies how well we perform few-shot scene graph prediction with limited examples per predicate. Our approach requires two sets of predicates, a set of frequently occurring predicates and a second set of rare predicates with only examples. we split the usual predicates typically used in Visual Genome, and place the most predicates with the most training examples into the first set and place the remaining predicates into the second set. In our experiments, we train the predicate functions and the graph convolution framework using the predicates in the first set. Next, we use them to train -shot classifiers for the rare predicates in the second set by utilizing the representations generated by the pretrained predicate functions. We iterate over .
For a rigorous comparison, we choose to compare our method against MotifNet zellers2017neural , which outperforms all existing scene graph approaches and uses linguistic priors from word embeddings and heuristic post-processing to generate high-quality scene graphs. Specifically, we report two different training variants of MotifNet: MotifNet-Baseline, which is initialized with random weights and trained only using labelled examples and MotifNet-Transfer, which is first trained on the frequent predicates and then finetuned on the few-shot predicates. We also compare against Ours-Baseline, which trains our graph convolution framework on the few-shot predicates and Ours-Oracle, which reports the upper bound performance when trained with all of Visual Genome.
Results in Figure 4 outline that our method performs better than all baseline comparisons for all values of . We find that our learned classifiers are similar in performance to MotifNet-Transfer when . This is likely because MotifNet-Transfer also has access to additional information available from word embeddings. The improvements seen by our approach increase as increases to , where we outperform the baselines by recall@. Eventually, as more labels becomes available, the Neural Motif model outperforms our model for values of .
5.3 Interpretable predicate transformation visualizations
Our final experiment showcases another utility of treating predicates as functions. Once trained, these functions can be individually visualized and qualitatively evaluated. Figure 5(left and middle) shows examples of transforming spatial attention from four instances of person, horse, boy, and banana in four images. We see that above and standing on moves attention below the person looking moves attention left towards the direction the horse is looking. wearing highlights the center of the boy. Figure 5(right) shows semantic transformations applied to the embedding representation space of objects. We see that riding transforms the embedding to a space that contains objects like wave, skateboard, bike and horse. Notice that unlike linguistic word embeddings, which are trained to place words found in similar contexts together, our embedding space represents the types of visual relationships that objects participate. We include more visualizations in our appendix.
We introduced the first scene graph prediction model that treats predicates as functions and generates object representations that can effectively enable few-shot learning. We treat predicates as neural network transformations between object representations. The functions disentangle the object representations from storing predicate information, and instead generates an embedding space with objects that embed similar relationships close together. Our representations outperform existing methods for few-shot predicate prediction, a valuable task since most predicates occur infrequently. Also, our graph convolution network, which trains the predicate functions, performs on par with existing scene graph prediction state-of-the-art models. Finally, the predicate functions result in interpretable visualizations, allowing us to visualize the spatial and semantic transformations learned for each predicate.
Acknowledgements We thank Iro Armeni, Suraj Nair, Vincent Chen, and Eric Li for their helpful comments. This work was partially funded by the Brown Institute of Media Innovation and by Toyota Research Institute (“TRI”) but this article solely reflects the opinions and conclusions of its authors and not TRI or any other Toyota entity.
P. Anderson, B. Fernando, M. Johnson, and S. Gould.
Spice: Semantic propositional image caption evaluation.
European Conference on Computer Vision, pages 382–398. Springer, 2016.
- (2) J. Andreas, M. Rohrbach, T. Darrell, and D. Klein. Learning to compose neural networks for question answering. arXiv preprint arXiv:1601.01705, 2016.
J. Andreas, M. Rohrbach, T. Darrell, and D. Klein.
Neural module networks.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 39–48, 2016.
- (4) M. Belkin, P. Niyogi, and V. Sindhwani. Manifold regularization: A geometric framework for learning from labeled and unlabeled examples. Journal of machine learning research, 7(Nov):2399–2434, 2006.
- (5) A. Bordes, N. Usunier, A. Garcia-Duran, J. Weston, and O. Yakhnenko. Translating embeddings for modeling multi-relational data. In Advances in neural information processing systems, pages 2787–2795, 2013.
- (6) V. S. Chen, P. Varma, R. Krishna, M. Bernstein, C. Re, and L. Fei-Fei. Scene graph prediction with limited labels. arXiv preprint arXiv:1904.11622, 2019.
- (7) A. Culotta and J. Sorensen. Dependency tree kernels for relation extraction. In Proceedings of the 42nd annual meeting on association for computational linguistics, page 423. Association for Computational Linguistics, 2004.
- (8) B. Dai, Y. Zhang, and D. Lin. Detecting visual relationships with deep relational networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3298–3308. IEEE, 2017.
- (9) C. Desai, D. Ramanan, and C. C. Fowlkes. Discriminative models for multi-class object layout. International journal of computer vision, 95(1):1–12, 2011.
- (10) D. K. Duvenaud, D. Maclaurin, J. Iparraguirre, R. Bombarell, T. Hirzel, A. Aspuru-Guzik, and R. P. Adams. Convolutional networks on graphs for learning molecular fingerprints. In Advances in neural information processing systems, pages 2224–2232, 2015.
- (11) A. Farhadi, I. Endres, D. Hoiem, and D. Forsyth. Describing objects by their attributes. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 1778–1785. IEEE, 2009.
- (12) L. Fe-Fei et al. A bayesian approach to unsupervised one-shot learning of object categories. In Proceedings Ninth IEEE International Conference on Computer Vision, pages 1134–1141. IEEE, 2003.
- (13) L. Fei-Fei, R. Fergus, and P. Perona. One-shot learning of object categories. IEEE transactions on pattern analysis and machine intelligence, 28(4):594–611, 2006.
- (14) V. Garcia and J. Bruna. Few-shot learning with graph neural networks. arXiv preprint arXiv:1711.04043, 2017.
- (15) J. Gilmer, S. S. Schoenholz, P. F. Riley, O. Vinyals, and G. E. Dahl. Neural message passing for quantum chemistry. arXiv preprint arXiv:1704.01212, 2017.
- (16) R. Girshick. Fast r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 1440–1448, 2015.
- (17) A. Grover and J. Leskovec. node2vec: Scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining, pages 855–864. ACM, 2016.
- (18) J. Gu, H. Zhao, Z. Lin, S. Li, J. Cai, and M. Ling. Scene graph generation with external knowledge and image reconstruction. arXiv preprint arXiv:1904.00560, 2019.
- (19) Z. GuoDong, S. Jian, Z. Jie, and Z. Min. Exploring various knowledge in relation extraction. In Proceedings of the 43rd annual meeting on association for computational linguistics, pages 427–434. Association for Computational Linguistics, 2005.
- (20) K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. arXiv preprint arXiv:1512.03385, 2015.
- (21) M. Henaff, J. Bruna, and Y. LeCun. Deep convolutional networks on graph-structured data. arXiv preprint arXiv:1506.05163, 2015.
- (22) R. Herzig, M. Raboh, G. Chechik, J. Berant, and A. Globerson. Mapping images to scene graphs with permutation-invariant structured prediction. In Advances in Neural Information Processing Systems, pages 7211–7221, 2018.
- (23) J. Johnson, A. Gupta, and L. Fei-Fei. Image generation from scene graphs. arXiv preprint arXiv:1804.01622, 2018.
- (24) J. Johnson, B. Hariharan, L. van der Maaten, J. Hoffman, L. Fei-Fei, C. L. Zitnick, and R. Girshick. Inferring and executing programs for visual reasoning. arXiv preprint arXiv:1705.03633, 2017.
- (25) J. Johnson, R. Krishna, M. Stark, L.-J. Li, D. Shamma, M. Bernstein, and L. Fei-Fei. Image retrieval using scene graphs. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3668–3678, 2015.
- (26) T. N. Kipf and M. Welling. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907, 2016.
G. Koch, R. Zemel, and R. Salakhutdinov.
Siamese neural networks for one-shot image recognition.
ICML deep learning workshop, volume 2, 2015.
- (28) P. Krähenbühl and V. Koltun. Efficient inference in fully connected crfs with gaussian edge potentials. In Advances in neural information processing systems, pages 109–117, 2011.
- (29) R. Krishna, I. Chami, M. Bernstein, and L. Fei-Fei. Referring relationships. In Computer Vision and Pattern Recognition, 2018.
- (30) R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. Kalantidis, L.-J. Li, D. A. Shamma, et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision, 123(1):32–73, 2017.
A. Kumar, O. Irsoy, P. Ondruska, M. Iyyer, J. Bradbury, I. Gulrajani, V. Zhong,
R. Paulus, and R. Socher.
Ask me anything: Dynamic memory networks for natural language processing.In International Conference on Machine Learning, pages 1378–1387, 2016.
- (32) B. Lake, R. Salakhutdinov, J. Gross, and J. Tenenbaum. One shot learning of simple visual concepts. In Proceedings of the Annual Meeting of the Cognitive Science Society, volume 33, 2011.
Y. Li, W. Ouyang, X. Wang, and X. Tang.
Vip-cnn: Visual phrase guided convolutional neural network.In Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on, pages 7244–7253. IEEE, 2017.
- (34) Y. Li, W. Ouyang, B. Zhou, J. Shi, C. Zhang, and X. Wang. Factorizable net: an efficient subgraph-based framework for scene graph generation. In European Conference on Computer Vision, pages 346–363. Springer, 2018.
- (35) Y. Li, W. Ouyang, B. Zhou, K. Wang, and X. Wang. Scene graph generation from objects, phrases and region captions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1261–1270, 2017.
- (36) Y. Li, D. Tarlow, M. Brockschmidt, and R. Zemel. Gated graph sequence neural networks. arXiv preprint arXiv:1511.05493, 2015.
- (37) X. Liang, L. Lee, and E. P. Xing. Deep variation-structured reinforcement learning for visual relationship and attribute detection. In Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on, pages 4408–4417. IEEE, 2017.
- (38) W. Liao, L. Shuai, B. Rosenhahn, and M. Y. Yang. Natural language guided visual relationship detection. arXiv preprint arXiv:1711.06032, 2017.
- (39) C. Lu, R. Krishna, M. Bernstein, and L. Fei-Fei. Visual relationship detection with language priors. In European Conference on Computer Vision, pages 852–869. Springer, 2016.
- (40) A. Mehrotra and A. Dukkipati. Generative adversarial residual pairwise networks for one shot learning. arXiv preprint arXiv:1703.08033, 2017.
- (41) A. Newell and J. Deng. Pixels to graphs by associative embedding. In Advances in Neural Information Processing Systems, pages 2168–2177, 2017.
- (42) M. Niepert, M. Ahmed, and K. Kutzkov. Learning convolutional neural networks for graphs. In International conference on machine learning, pages 2014–2023, 2016.
- (43) D. Parikh and K. Grauman. Relative attributes. In Computer Vision (ICCV), 2011 IEEE International Conference on, pages 503–510. IEEE, 2011.
- (44) B. Perozzi, R. Al-Rfou, and S. Skiena. Deepwalk: Online learning of social representations. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 701–710. ACM, 2014.
- (45) S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pages 91–99, 2015.
- (46) F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Monfardini. The graph neural network model. IEEE Transactions on Neural Networks, 20(1):61–80, 2009.
- (47) M. Schlichtkrull, T. N. Kipf, P. Bloem, R. v. d. Berg, I. Titov, and M. Welling. Modeling relational data with graph convolutional networks. arXiv preprint arXiv:1703.06103, 2017.
- (48) S. Schuster, R. Krishna, A. Chang, L. Fei-Fei, and C. D. Manning. Generating semantically precise scene graphs from textual descriptions for improved image retrieval. In Proceedings of the fourth workshop on vision and language, pages 70–80, 2015.
- (49) J. Snell, K. Swersky, and R. Zemel. Prototypical networks for few-shot learning. In Advances in Neural Information Processing Systems, pages 4077–4087, 2017.
- (50) E. Triantafillou, R. Zemel, and R. Urtasun. Few-shot learning through an information retrieval lens. In Advances in Neural Information Processing Systems, pages 2255–2265, 2017.
- (51) Z. Tu and X. Bai. Auto-context and its application to high-level vision tasks and 3d brain image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(10):1744–1757, 2010.
- (52) O. Vinyals, C. Blundell, T. Lillicrap, D. Wierstra, et al. Matching networks for one shot learning. In Advances in neural information processing systems, pages 3630–3638, 2016.
- (53) J. Weston, F. Ratle, H. Mobahi, and R. Collobert. Deep learning via semi-supervised embedding. In Neural Networks: Tricks of the Trade, pages 639–655. Springer, 2012.
- (54) C. Xiong, S. Merity, and R. Socher. Dynamic memory networks for visual and textual question answering. In International conference on machine learning, pages 2397–2406, 2016.
- (55) D. Xu, Y. Zhu, C. B. Choy, and L. Fei-Fei. Scene graph generation by iterative message passing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, volume 2, 2017.
- (56) J. Yang, J. Lu, S. Lee, D. Batra, and D. Parikh. Graph r-cnn for scene graph generation. arXiv preprint arXiv:1808.00191, 2018.
- (57) R. Yu, A. Li, V. I. Morariu, and L. S. Davis. Visual relationship detection with internal and external linguistic knowledge distillation. arXiv preprint arXiv:1707.09423, 2017.
- (58) R. Zellers, M. Yatskar, S. Thomson, and Y. Choi. Neural motifs: Scene graph parsing with global context. arXiv preprint arXiv:1711.06640, 2017.
- (59) H. Zhang, Z. Kyaw, S.-F. Chang, and T.-S. Chua. Visual translation embedding network for visual relation detection. In CVPR, volume 1, page 5, 2017.
- (60) D. Zhou, O. Bousquet, T. N. Lal, J. Weston, and B. Schölkopf. Learning with local and global consistency. In Advances in neural information processing systems, pages 321–328, 2004.
- (61) G. Zhou, M. Zhang, D. Ji, and Q. Zhu. Tree kernel-based relation extraction with context-sensitive structured parse tree information. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), 2007.
We include additional scene graph outputs by our graph convolution model, include visualizations for the spatial and the semantic transformations and finally plot a visualization of the object feature space.
7.1 More scene graph model outputs
Figure 6 shows more examples of scene graphs generated by our model. The scene graph for the image in the middle of a woman riding a motorcycle shows that our model is able to identify the main action taking place in the image. It is also able to correctly identify parts of the motorcycle, such as seat, tire, and light. The scene graph and image in the bottom right shows that our model can identify parts of the woman’s body, like nose and leg. It is also able to predict the woman’s actions: carrying the bag and holding the umbrella.
7.2 Semantic transformations
Table 2 shows more examples of semantic transformations applied to the embedding feature space of objects. child transformed by walking on resembles objects that we walk on: street, sidewalk, and snow. We also learn more specific and rare relationships such as attached to. We observe that sign transformed by attached to most closely resembles objects such as pole and fence.
|girl||riding||wave, skateboard, bike, horse|
|man||wears||shirt, jacket, hat, cap, helmet|
|person||has||hair, head, face, arm, ear|
|dog||laying on||bed, beach, bench, desk, table|
|child||walking on||street, sidewalk, snow, beach|
|boy||sitting on||bench, bed, desk, chair, toilet|
|umbrella||covering||kid, people, skier, person, guy|
|tail||belonging to||cat, elephant, giraffe, dog|
|stand||over||street, sidewalk, beach, hill|
|mountain||and||hill, mountain, skier, snow|
|motorcycle||parked on||street, sidewalk, snow, beach|
|sign||attached to||pole, fence, shelf, post, building|
|sidewalk||in front of||building, room, house, fence|
|kid||watching||giraffe, zebra, plane, horse|
|men||looking at||airplane, plane, bus, laptop|
|child||standing on||sidewalk, beach, snow, track|
|guy||holding||racket, umbrella, glass, bag|
|motorcycle||has||heel wing handle tire engine|
7.3 Inverse predicate functions
To understand the effect of including inverse predicate functions, we performed an ablation study where the inverse predicate functions were omitted from the model. We found that the semantic-only model trained without inverse functions performed worse on recall@ than the semantic model with inverse functions.
We also visualize how these inverse functions transform a particular subject when compared to the output of the forward function as shown in Figure 7. We observe that the spatial function for the predicate riding shifts attention below the person in the image. Qualitatively, this is the expected result because the skateboard is below the person. The inverse transformation of riding shifts the skateboard mask slightly above the skateboard. Similarly, this is also the expected result because skateboarders are typically above their boards.
7.4 Visualize Object Representations
In the process of training our predicate functions, we learn representations for each object instance we encounter. From the embedding of each object instance, we calculate the average object category embedding. Each of the distinct object categories is embedded into a learned -dimension space. Figure 8 shows a t-SNE visualization of these embeddings. We observe object categories that participate in similar relationships grouped together. For example, embeddings for bird, cow, bear, and other animals are close together (inside the red rectangle).