Humans are able to do common sense reasoning across a variety of modalities – textual, visual and for a variety of tasks – reasoning, locating, navigation. Several such tasks require spatial knowledge understanding and reasoning , , , , . Although, there has been several recent works on common-sense reasoning , , , , progress on spatial understanding and common-sense is rather limited. Prior work has either not been spatially focused , , or very restrictive in the class of spatial relations they handle , . Recently,  presented the task of predicting an object’s location and size in an image given the subject’s bounding box and the spatial relation between them.
In this paper, we address the problem of understanding spatial relations. We specifically want to infer the spatial relationship between two entities given an image involving them. Spatial relations can either be – explicit (spatial prepositions such as on, above, under) or implicit (intrinsic spatial concepts associated with actions – sleeping, sitting, flying). We take as input an image and the bounding boxes of the entities which are spatially related, the word and image embeddings of the two entities, and we want to predict the spatial relation between them (eg: sleeping, standing, sitting-on in Figure 1. We compare this with powerful language models like BERT  which have been trained on very large text corpora in a variety of contexts. Although BERT is not conditioned on the image, it can still provide a good list of candidate relations between the two entities using only the structured text information and it can provide new relations unseen in the specific dataset we train on. We show that these complementary attributes – being able to use the image-specific information by a task-specific model and being able to predict a wide range of relations – by BERT, can together give better performance than either approach alone, especially in low-resource and generalized settings . For the woman,?,bed example, if we were to rely on a language model alone, the prediction would be laying on or sleeping on almost all the time. This is because using just the textual modality makes it blind to the image specific cues. Using both the textual and visual information leads to a more robust model conditioned on both the image and the subject and object text description.
Our contributions are four-fold: (1) New task definition of explicit and implicit spatial relation prediction for two entities in an image. We explore another dimension of commonsense understanding of spatial relations with visual and language information. (2) Usage of Spatial BERT: combination of BERT and a spatial model. We conduct thorough experiments to show the role of the image, position and language information for the task under different settings-varying the number of training examples, the type of the spatial relations and the type of BERT.111https://github.com/sdan2/Multimodal-Spatial (3) As a byproduct, we propose a re-scoring technique for evaluating this model combination. (4) We show that Spatial BERT is able to predict in the generalized setting – for unseen subjects, objects or relations.
2 Model Details
2.1 The Basics
This task is to predict the spatial relation given the subject and the object . For the subject , we have the corresponding text information , position information , and image information . Similarly, we have the corresponding text information , position information , and image information for the object .
2.2 The Spatial Model
Feed Forward Network (FF). For the text information, we use average (for multi-word subjects and objects) glove embeddings  of the words in the text to represent it. For simplicity, we use and to represent the average glove embbeddings for the and . As for the position , it contains float values denoting the and coordinate of the subject(object) center and the half-width and half-height of the bounding box of the subject(object). We then pass , , , through a feed forward network with neurons and take a softmax of the predictions of all possible relations:
Feed Forward Network + Image Embeddings (FF+I). In addition to the text information and position information of the subject and object, we use pre-trained visual embeddings to represent the image information and . For simplicity, we use and to represent the visual embeddings for the subject and the object. The visual embeddings of the words are provided by  from a VGG128 network 
pre-trained on Imagenet and fine-tuned on Visual Genome
. The hidden representation in the FF + Image Embeddings are as following:
2.3 The Language Model
We use BERT  as language model to predict the most likely spatial relation by masking the relation and providing ”subject [MASK] object” as input and running the beam search to obtain top predictions.
Fine-tuned BERT (f-BERT). We fine-tune BERT for the Visual Genome datatset by collecting all the ”subject relation object” texts from the training data.
2.4 Spatial BERT
We combine the prediction of our best model (FF+I) with normal BERT and fine-tuned BERT to get two combined models. We essentially re-rank the predictions from the two models. Assume and, where is a non-negative float to adjust the weight of the BERT predictions.
We use the Visual Genome dataset . We work on the (subject, relation, object) triples where the subject and object is accompanied by the bounding box information222Note each image is scaled to so that the sizes are comparable. See  for more information on the data-preprocessing step.. The dataset is partitioned into two categories: explicit and implicit based on the spatial relation in a triple and the experiments are performed separately for each category. There are implicit and explicit triples. In Figure 2 we see examples from the dataset for the relations between as and as . In Figure 3 we see examples from the dataset for the relations between as and as .
For the explicit relations, on is the majority relation with frequency and the majority baseline for explicit relation prediction is %. Similarly, for the implicit relations, has is the majority relation with a frequency and the majority baseline for implicit relation prediction is %.
3.2 GloVe based Re-Scoring Metric
One of the principal benefits of using language models is that they can predict new relations never seen during the training of the spatial model (see Figure 4). In several cases (especially the low training data regimes) where BERT is used in combination with the spatial model, BERT may predict spatial relations which are unseen in training and we need an effective strategy to re-score the spatial model predictions based on these unseen predictions. For example, say, BERT predicts atop (not seen during training) as the relation between book and table for an image whose gold triple is (book, on, table). In such situations we develop a re-scoring metric to distribute the score BERT assigns to unseen relations (atop) among the seen relations (on, above, over.
3.3 Experimental Settings
We present the experiment results separately for the explicit and implicit relations. For each type of relation, we vary the percentage of data used for training as , , , , . We try two variations of BERT – the normal pre-trained BERT and f-BERT: BERT fine-tuned on the implicit and explicit dataset respectively. We try variety of (from to ) to combine the BERT scores and the spatial model scores and we report the best result across different values of ( we exclude or (large positive values) which are already reflected in only-BERT and only the spatial model results). In all the settings the train-development-test split is set as .
|of Training Data||1%||10%||50%||75%||100%||1%||10%||50%||75%||100%|
|Model||Expl(S, R)||Impl(S, R)||Expl(O, R)||Impl(O, R)||Expl(R)||Impl(R)|
We see that the best performance in several of the settings is achieved by Spatial BERT. Although the gains may look small, the number of data-points is huge and thus, the improvement is significant in terms of absolute counts. For smaller percentages of training data, the spatial models perform poorly but later on beats BERT. Also, notice BERT performs significantly better for the explicit spatial relations than the implicit spatial relations possibly because the set of implicit relations is much larger and the task of predicting them requires more image understanding and common-sense compared to the explicit spatial relations. BERT and f-BERT performances do not change across different data percentages because they are both just used for inference and the amount of training data does not affect them.
4 Unseen Subject, Object or Relation
It is greatly desirable that models learn to generalize to unseen contexts and is necessary for true spatial understanding of the relations.
4.1 The Settings
If a spatial relation was seen during training – (man, riding, horse) and the supporting image, the model should be able to infer that the implicit (in this example) spatial relation for a new image depicting (lady, riding, elephant) should be riding, even if it has never seen the subject (lady) or object (elephant) before. We show two illustrative examples from Visual Genome that we want to handle for the unseen subject (Figure 5) and unseen object (Figure 6) settings respectively. The pre-trained embeddings of the unseen subject(object) is similar to the embeddings of similar seen subject(object) and this (along with the positional information) should help the model identify that the relations are similar. To systematically test this capability, we perform experiments for the implicit and explicit relations in Table 2. For each dataset type we experiment with three settings – unseen subject, unseen object and unseen relation.
The unseen subject(object) setting is relatively easier compared to the unseen relation setting. For subject, we test if we can correctly predict the flying relation in (kid, flying, kite) even if we have never seen (kid, flying, ) during training. For objects, we test if we can correctly predict the riding relation in (man, riding, elephant) even if we have never seen (, riding, elephant) during training. Here, denotes any object or subject, respectively. We first tabulate all the (subject, relation), (object,relation) pairs and we split this list into the test set pairs() and the training and development set pairs(). We then form the test set (train, development) by collecting all data-points whose (subject,relation), (object,relation) is in the test( train or development) set pairs. Thus, the test set has (subject,relation), (object,relation) which are not seen during training. We only use the normal pre-trained BERT for the generalized experiments since fine-tuning is not very natural for these settings.
Unseen Subject. For the explicit relations, we see that Spatial BERT performs much better than either model in isolation. This is potentially because the explicit relations are a smaller set and easier for BERT to predict and this in turn helps Spatial BERT. However, since the class of implicit relations are much larger and subtle BERT performs very poorly and spatial model by itself performs the best for the implicit relations.
Unseen Object. As shown in Table 2, for both the explicit and the implicit relations Spatial BERT gives the best performance although BERT by itself does not perform very well.
Unseen Relation. This task is possible because BERT is able to predict a much larger class of relations and using our GloVe-scoring metric we can decompose the scores of these relations across the known set of relations and this should bring the correct relations towards the top of the ranked list. We also relax the accuracy metric by counting a prediction as correct if the gold relation is in the top- predicted relations. In this setting, the spatial model by itself cannot predict any unseen relation and thus, BERT gives the best performance.
We presented the task of spatial relation prediction and developed models that use position information, visual embeddings and word embeddings to predict relations. Further we show that combining this model with BERT helps for low resource settings and generalization over unseen subjects, objects and relations. However, the visual embeddings for entities which are used in our models are of limited use in some situations. For example, it is hard for our models to distinguish between (man, running, dog) and (man, walking, dog) given (man, –, dog). In future we want to use more principled ways of incorporating image specific information to have an even more fine-grained spatial relation classification. We also want to develop an interactive system that does both the object position prediction task  and the spatial relation prediction task for a spatially-involved domain such as Blocks World .
This work was supported by Contract W911NF-15-1-0461 with the US Defense Advanced Research Projects Agency (DARPA) and the Army Research Office (ARO) and by Contract FA8750-19-2-0201 with the US Defense Advanced Research Projects Agency (DARPA). Approved for Public Release, Distribution Unlimited. The views expressed are those of the authors and do not reflect the official policy or position of the Department of Defense or the U.S. Government.
Vqa: visual question answering.
Proceedings of the IEEE international conference on computer vision, pp. 2425–2433. Cited by: §1.
-  (2018) Points, paths, and playscapes: large-scale spatial language understanding tasks set in the real world. In Proceedings of the First International Workshop on Spatial Language Understanding, pp. 46–52. Cited by: §1.
-  (2016) Towards a dataset for human computer communication via grounded language acquisition.. In AAAI Workshop: Symbiotic Cognitive Systems, Cited by: §5.
-  (2014) Return of the devil in the details: delving deep into convolutional nets. arXiv preprint arXiv:1405.3531. Cited by: §2.2.
-  (2018) Learning representations specialized in spatial knowledge: leveraging language and vision. Transactions of the Association of Computational Linguistics 6, pp. 133–144. Cited by: §2.2.
Acquiring common sense spatial knowledge through implicit spatial templates.
Thirty-Second AAAI Conference on Artificial Intelligence, Cited by: §1, §1, §5, footnote 2.
-  (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §1, §2.3.
-  (2018) A knowledge hunting framework for common sense reasoning. arXiv preprint arXiv:1810.01375. Cited by: §1.
Clevr: a diagnostic dataset for compositional language and elementary visual reasoning.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2901–2910. Cited by: §1.
Spatial language understanding with multimodal graphs using declarative learning based programming.
Proceedings of the 2nd Workshop on Structured Prediction for Natural Language Processing, pp. 33–43. Cited by: §1.
-  (2010) Spatial role labeling: task definition and annotation scheme.. In LREC, Cited by: §1.
-  (2011) Spatial role labeling: towards extraction of spatial relations from natural language. ACM Transactions on Speech and Language Processing (TSLP) 8 (3), pp. 4. Cited by: §1.
-  (2016) Visual genome: connecting language and vision using crowdsourced dense image annotations. External Links: Cited by: §2.2, §3.1.
-  (2016) Leveraging visual question answering for image-caption ranking. In European Conference on Computer Vision, pp. 261–277. Cited by: §1.
-  (2014) Glove: global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp. 1532–1543. Cited by: §2.2.
-  (2018) ATOMIC: an atlas of machine commonsense for if-then reasoning. arXiv preprint arXiv:1811.00146. Cited by: §1.
-  (2012) Representing general relational knowledge in conceptnet 5.. In LREC, pp. 3679–3686. Cited by: §1.
-  (2019) Commonsense reasoning for natural language understanding: a survey of benchmarks, resources, and approaches. arXiv preprint arXiv:1904.01172. Cited by: §1.
-  (2017) Naturalizing a programming language via interactive learning. arXiv preprint arXiv:1704.06956. Cited by: §1.
-  (2017) Commonsense locatednear relation extraction. arXiv preprint arXiv:1711.04204. Cited by: §1.
-  (2015) Show, attend and tell: neural image caption generation with visual attention. arXiv preprint arXiv:1502.03044. Cited by: §1.