Learning to Relate from Captions and Bounding Boxes

12/01/2019 ∙ by Sarthak Garg, et al. ∙ Carnegie Mellon University 11

In this work, we propose a novel approach that predicts the relationships between various entities in an image in a weakly supervised manner by relying on image captions and object bounding box annotations as the sole source of supervision. Our proposed approach uses a top-down attention mechanism to align entities in captions to objects in the image, and then leverage the syntactic structure of the captions to align the relations. We use these alignments to train a relation classification network, thereby obtaining both grounded captions and dense relationships. We demonstrate the effectiveness of our model on the Visual Genome dataset by achieving a recall@50 of 15 25 successfully predicts relations that are not present in the corresponding captions.



There are no comments yet.


page 5

page 8

page 9

page 10

Code Repositories

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Scene graphs serve as a convenient representation to capture the entities in an image and the relationships between them, and are useful in a variety of settings (for example, Johnson et al. (2015); Anderson et al. (2016); Liu et al. (2017)

). While the last few years have seen considerable progress in classifying the contents of an image and segmenting the entities of interest without much supervision

(He et al., 2017), the task of identifying and understanding the way in which entities in an image interact with each other without much supervision remains little explored.

Recognizing relationships between entities is non-trivial because the space of possible relationships is immense, and because there are relationships possible when objects are present in an image. On the other hand, while image captions are easier to obtain, they are often not completely descriptive of an image (Krishna et al., 2017). Thus, simply parsing a caption to extract relationships from them is likely to not sufficiently capture the rich content and detailed spatial relationships present in an image.

Since different images have different objects and captions, we believe it is possible to get the information that is not present in the caption of one image from other similar images which have the same objects and their captions. In this work, we thus aim to learn the relationships between entities in an image by utilizing only image captions and object locations as the source of supervision. Given that generating a good caption in an image requires one to understand the various entities and the relationships between them, we hypothesize that an image caption can serve as an effective weak supervisory signal for relationship prediction.

Figure 1: C-GEARD Architecture (left) and it’s integration with the relation classifier (right). C-GEARD acts as the Grounding Module (GM) in our relation classifier.

2 Related work

The task of Visual Relationship Detection has been the main focus of several recent works (Lu et al., 2016; Li et al., 2017a; Zhang et al., 2017a; Dai et al., 2017; Hu et al., 2017; Liang et al., 2017; Yin et al., 2018). The goal is to detect a generic subject, predicate, object triplet present in an image. Various techniques have been proposed to solve this task, such as by using language priors Lu et al. (2016); Yatskar et al. (2016), deep network models Zhang et al. (2017a); Dai et al. (2017); Zhu and Jiang (2018); Yin et al. (2018), referring expressions Hu et al. (2017); Cirik et al. (2018)

and reinforcement learning

Liang et al. (2017). Recent work has also studied the closely related problem of Scene Graph Generation, (Li et al., 2017b; Newell and Deng, 2017; Xu et al., 2017; Yang et al., 2017, 2018). The major limitation of the aforementioned techniques is that they are supervised, and require the presence of ground truth scene graphs or relation annotations. Obtaining these annotations can be an extremely tedious and time consuming process that often needs to be done manually. Our model in contrast does the same task through weak supervision, which makes this annotation significantly easier.

Most similar to our current task is work in the domain of Weakly Supervised Relationship Detection. Peyre et al. (2017) uses weak supervision to learn the visual relations between the pairs of objects in an image using a weakly supervised discriminative clustering objective function (Bach and Harchaoui, 2008), while Zhang et al. (2017b)

uses a region-based fully convolutional network neural network to perform the same task. They both use

subject, predicate, object annotations without any explicit grounding in the images as the source of weak supervision, but require these annotations in the form of image-level triplets. Our task, however, is more challenging, because free-form captions can potentially be both extremely unstructured and significantly less informative than annotated structured relations.

3 Proposed Approach

Our proposed approach consists of three sequential modules: a feature extraction module, a grounding module and a relation classifier module. Given the alignments found by the grounding module, we train the relation classifier module, which takes in a pair of object features and classifies the relation between them.

3.1 Feature Extraction

Given an image with objects and their ground truth bounding boxes {}, the feature extraction module extracts their feature representations

. To avoid using ground truth instance-level class annotations that would be required to train an object detector, we use a ResNet-152 network pre-trained on ImageNet as our feature extractor. For every object

, we crop and resize the portion of the image corresponding to the bounding box and feed it to the ResNet model to get its feature representation . is a dense

-dimensional vector capturing the semantic information of the

object. Note that we do not fine-tune the ResNet architecture.

3.2 Grounding Caption Words to Object Features

Given an image , its caption consisting of words {} and the feature representations obtained above, the grounding module aligns the entities and relations found in the captions with the objects’ features and the features corresponding to pairs of objects in the image. It thus aims to find the subset of words in the caption corresponding to entities , and to ground each such word with its best matching object feature . It also aims to find the subset of relational words and to ground each relation to a pair of object features {} which correspond to the subject and object of that relation.

To identify and ground the relations between entities in an image, we propose C-GEARD (Captioning-Grounding via Entity Attention for Relation Detection). C-GEARD passes the caption through the Stanford Scene Graph Parser (Schuster et al., 2015) to get the set of triplets . Each triplet corresponds to one relation present in the caption. For , , and denote subject, predicate and object respectively. The entity and relation subsets are then constructed as:

Captioning using visual attention has proven to be very successful in aligning the words in a caption to their corresponding visual features, such as in Anderson et al. (2018). As shown in Figure 1, we adopt the two-layer LSTM architecture in Anderson et al. (2018); our end goal, however, is to associate each word with the closest object feature rather than producing a caption.

The lower Attention LSTM cell takes in the words and the global image context vector (, the mean of all features ), and its hidden state acts as a query vector. This query vector is used to attend over the object features {} (serving as both key and value vectors) to produce an attention vector which summarizes the key visual information needed for predicting the next word. The Attention module is parameterized as in Bahdanau et al. (2014). The concatenation of the query vector and the attention vector is passed as an input to the upper LM-LSTM cell, which predicts the next word of the caption.

The model is trained by minimizing the standard negative log-likelihood loss.


denote the attention probability over feature

when previous word is fed into the LSTM. C-GEARD constructs alignments of the entity and relation words as follows:

3.3 Relation Classifier

We run the grounding module C-GEARD over the training captions to generate a “grounded” relationship dataset consisting of tuples {}, where and are two object features and refers to the corresponding aligned predicates. These predicates occur in free form; however, the relations in the test set are restricted to only the top relation classes. We manually annotate the correspondence between the most frequent parsed predicates and their closest relation class. For example, we map the parsed predicates dress in, sitting in and inside to the canonical relation class in. Using this mapping we get tuples of the form {} where denotes the canonical class corresponding to

Since this dataset is generated by applying the grounding module on the set of all images and the corresponding captions, it pools the relation information from across the whole dataset, which we then use to train our relation classifier.

We parameterize the relation classifier with a -layer MLP. Given the feature vectors of any two objects and , the relation classifier is trained to classify the relation between them.

3.4 Model at Inference

During inference, the features extracted from each pair of objects is passed through the relation classifier to predict the relation between them.

Xu et al. (2017)
Newell and Deng (2017)
Yang et al. (2018)
Parsed caption
50 44.8 68.0 54.2 4.1 15.3
100 53.0 75.2 59.1 4.1 25.2
Table 1: Comparison with respect to Recall@50 and Recall@100 on PredCls metric, in %.

4 Experiments

4.1 Dataset

We use the MS COCO Lin et al. (2014) dataset for training and the Visual Genome Krishna et al. (2017) dataset for evaluation. MS COCO has images and their captions, and Visual Genome contains images and their associated scene graphs. The Visual Genome dataset consists in part of MS COCO images, and since we require ground truth captions and bounding boxes during training, we filter the Visual Genome dataset by considering only those images which are part of the original MS COCO dataset. Similar to Xu et al. (2017), we manually remove poor quality and overlapping bounding boxes with ambiguous object names, and filter to keep the 150 most frequent object categories and 50 most frequent predicates. Our final dataset thus comprises of 41,731 images with 150 unique objects and 50 unique relations. We use a 70-30 train-validation split. We use the same test set as Xu et al. (2017), so that the results are comparable with other supervised baselines.

4.2 Baselines

Since, to the best of our knowledge, this work is the first to introduce the task of weakly supervised relationship prediction solely using captions and bounding boxes, we do not have any directly comparable baselines, i.e., all other work is either completely supervised or relies on all ground truth entity-relation triplets being present at train time. Consequently, we construct baselines relying solely on captions and ground truth bounding box locations that are comparable to our task. In particular, running the Stanford Scene Graph Parser Schuster et al. (2015) on ground truth captions constructs a scene graph just from the image captions (which almost never capture all the information present in an image). We use this baseline as a lower bound, and to obtain insight into the limitations of scene graphs directly generated from captions. On the other hand, we use supervised scene graph generation baselines Yang et al. (2018); Newell and Deng (2017) to upper bound our performance, since we rely on far less information and data.

4.3 Evaluation Metric

As our primary objective is to detect relations between entities, we use the PredCls evaluation metric

(Xu et al., 2017), defined as the performance of recognizing the relation between two objects given their ground truth locations. We only use the entity bounding boxes’ locations without knowing the ground truth objects they contain. We show results on Recall@ (the fraction of top relations predicted by the model contained in the ground truth) for and . The predicted relations are ranked over all objects pairs for all relation classes by the relation classifier’s model confidence.

Figure 2: Attention masks for each of the entities in the caption for C-GEARD. The output of the Stanford Scene Graph Parser is given on the right.

5 Results and Discussion

5.1 Performance

We show the performance of C-GEARD in Table 1. We compare its performance with various supervised baselines, as well as a baseline which parses relations from just the caption using Stanford Scene Graph Parser Schuster et al. (2015) (caption-only baseline), on the PredCls metric. Our proposed method substantially outperforms the caption-only baseline. This shows that our model predicts relationships more successfully than by purely relying on captions, which contain limited information. This in turn supports our hypothesis that it is possible to detect relations by pooling information from captions across images, without requiring all ground truth relationship annotations for every image.

Note that our model is at a significant disadvantage when compared to supervised approaches. First, we use pre-trained ResNet features (trained on a classification task) without any fine-tuning; supervised methods, however, use Faster RCNN Ren et al. (2015), whose features are likely much better suited for multiple objects. Second, supervised methods likely have a better global view than C-GEARD, because Faster RCNN provides a significantly larger number of proposals, while we rely on ground truth regions which are far fewer in number. Third, and most significant, we have no ground truth relationship or class information, relying purely on weak supervision from captions to provide this information. Finally, since we require captions, we use significantly less data, training on the subset of Visual Genome overlapping with MS COCO (and has ground truth captions as a result).

5.2 Relation Classification

We train the relation classifier on image features of entity pairs and using the relations found in the caption as the only source of supervision. On the validation set, we obtain a relation classification accuracy of 22%.

We compute the top relations that the model gets most confused about, shown in Table 2. We observe that even when the predictions are not correct, they are semantically close to the ground truth relation class.

Relation Confusion with Relations
above on, with, sitting on, standing on, of
carrying holding, with, has, carrying, on
laying on on, lying on, in, has
mounted on on, with, along, at, attached to
Table 2: Relations the classification model gets most confused about

5.3 Visualizations

Three images with their captions are given in Figure 2. We can see that C-GEARD generates precise entity groundings, and that the Stanford Scene Graph Parser generates correct relations. This results in the correct grounding of the entities and relations which yields accurate training samples for the relation classifier.

6 Conclusion

In this work, we propose a novel task of weakly-supervised relation prediction, with the objective of detecting relations between entities in an image purely from captions and object-level bounding box annotations without class information. Our proposed method builds upon top-down attention (Anderson et al., 2018), which generates captions and grounds word in these captions to entities in images. We leverage this along with structure found from the captions by the Stanford Scene Graph Parser (Schuster et al., 2015) to allow for the classification of relations between pairs of objects without having ground truth information for the task. Our proposed approaches thus allow weakly-supervised relation detection.

There are several interesting avenues for future work. One possible line of work involves removing the requirement of ground truth bounding boxes altogether by leveraging a recent line of work that does weakly-supervised object detection (such as (Oquab et al., 2015; Bilen and Vedaldi, 2016; Zhang et al., 2018; Bai and Liu, 2017; Arun et al., 2018)). This would reduce the amount of supervision required even further. An orthogonal line of future work might involve using a Visual Question Answering (VQA) task (such as in Krishna et al. (2017)), either on its own replacing the captioning task, or in conjunction with the captioning task with a multi-task learning objective.


We would like to thank Louis-Philippe Morency, Carla Viegas, Volkan Cirik and Barun Patra for helpful discussions and feedback. We would also like to thank the anonymous reviewers for their insightful comments and suggestions.


  • P. Anderson, B. Fernando, M. Johnson, and S. Gould (2016) Spice: semantic propositional image caption evaluation. In

    European Conference on Computer Vision

    pp. 382–398. Cited by: §1.
  • P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L. Zhang (2018)

    Bottom-up and top-down attention for image captioning and visual question answering

    In CVPR, Vol. 3, pp. 6. Cited by: §3.2, §6.
  • A. Arun, C. Jawahar, and M. P. Kumar (2018) Dissimilarity coefficient based weakly supervised object detection. arXiv preprint arXiv:1811.10016. Cited by: §6.
  • F. R. Bach and Z. Harchaoui (2008) Diffrac: a discriminative and flexible framework for clustering. In Advances in Neural Information Processing Systems, pp. 49–56. Cited by: §2.
  • D. Bahdanau, K. Cho, and Y. Bengio (2014) Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473. Cited by: §3.2.
  • P. T. X. W. X. Bai and W. Liu (2017) Multiple instance detection network with online instance classifier refinement. Cited by: §6.
  • H. Bilen and A. Vedaldi (2016) Weakly supervised deep detection networks. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    pp. 2846–2854. Cited by: §6.
  • V. Cirik, T. Berg-Kirkpatrick, and L. Morency (2018) Using syntax to ground referring expressions in natural images. In

    Thirty-Second AAAI Conference on Artificial Intelligence

    Cited by: §2.
  • B. Dai, Y. Zhang, and D. Lin (2017) Detecting visual relationships with deep relational networks. In Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on, pp. 3298–3308. Cited by: §2.
  • K. He, G. Gkioxari, P. Dollár, and R. Girshick (2017) Mask r-cnn. In Computer Vision (ICCV), 2017 IEEE International Conference on, pp. 2980–2988. Cited by: §1.
  • R. Hu, M. Rohrbach, J. Andreas, T. Darrell, and K. Saenko (2017) Modeling relationships in referential expressions with compositional modular networks. In Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on, pp. 4418–4427. Cited by: §2.
  • J. Johnson, R. Krishna, M. Stark, L. Li, D. Shamma, M. Bernstein, and L. Fei-Fei (2015) Image retrieval using scene graphs. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3668–3678. Cited by: §1.
  • R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. Kalantidis, L. Li, D. A. Shamma, et al. (2017) Visual genome: connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision 123 (1), pp. 32–73. Cited by: §1, §4.1, §6.
  • Y. Li, W. Ouyang, X. Wang, and X. Tang (2017a)

    Vip-cnn: visual phrase guided convolutional neural network

    In Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on, pp. 7244–7253. Cited by: §2.
  • Y. Li, W. Ouyang, B. Zhou, K. Wang, and X. Wang (2017b) Scene graph generation from objects, phrases and region captions. In ICCV, Cited by: §2.
  • X. Liang, L. Lee, and E. P. Xing (2017) Deep variation-structured reinforcement learning for visual relationship and attribute detection. In Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on, pp. 4408–4417. Cited by: §2.
  • T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick (2014) Microsoft coco: common objects in context. In European conference on computer vision, pp. 740–755. Cited by: §4.1.
  • S. Liu, Z. Zhu, N. Ye, S. Guadarrama, and K. Murphy (2017) Improved image captioning via policy gradient optimization of spider. In Proc. IEEE Int. Conf. Comp. Vis, Vol. 3, pp. 3. Cited by: §1.
  • C. Lu, R. Krishna, M. Bernstein, and L. Fei-Fei (2016) Visual relationship detection with language priors. In European Conference on Computer Vision, pp. 852–869. Cited by: §2.
  • A. Newell and J. Deng (2017) Pixels to graphs by associative embedding. In Advances in neural information processing systems, pp. 2171–2180. Cited by: §2, Table 1, §4.2.
  • M. Oquab, L. Bottou, I. Laptev, and J. Sivic (2015)

    Is object localization for free?-weakly-supervised learning with convolutional neural networks

    In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 685–694. Cited by: §6.
  • J. Peyre, J. Sivic, I. Laptev, and C. Schmid (2017) Weakly-supervised learning of visual relations. In Proceedings of the IEEE International Conference on Computer Vision, pp. 5179–5188. Cited by: §2.
  • S. Ren, K. He, R. Girshick, and J. Sun (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pp. 91–99. Cited by: §5.1.
  • S. Schuster, R. Krishna, A. Chang, L. Fei-Fei, and C. D. Manning (2015) Generating semantically precise scene graphs from textual descriptions for improved image retrieval. In Proceedings of the fourth workshop on vision and language, pp. 70–80. Cited by: §A.1.1, §3.2, §4.2, §5.1, §6.
  • D. Xu, Y. Zhu, C. B. Choy, and L. Fei-Fei (2017) Scene graph generation by iterative message passing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Vol. 2. Cited by: §2, Table 1, §4.1, §4.3.
  • J. Yang, J. Lu, S. Lee, D. Batra, and D. Parikh (2018) Graph r-cnn for scene graph generation. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 670–685. Cited by: §2, Table 1, §4.2.
  • M. Y. Yang, W. Liao, H. Ackermann, and B. Rosenhahn (2017) On support relations and semantic scene graphs. ISPRS journal of photogrammetry and remote sensing 131, pp. 15–25. Cited by: §2.
  • M. Yatskar, L. Zettlemoyer, and A. Farhadi (2016) Situation recognition: visual semantic role labeling for image understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5534–5542. Cited by: §2.
  • G. Yin, L. Sheng, B. Liu, N. Yu, X. Wang, J. Shao, and C. Change Loy (2018)

    Zoom-net: mining deep feature interactions for visual relationship recognition

    In Proceedings of the European Conference on Computer Vision (ECCV), pp. 322–338. Cited by: §2.
  • H. Zhang, Z. Kyaw, S. Chang, and T. Chua (2017a) Visual translation embedding network for visual relation detection. In CVPR, Vol. 1, pp. 5. Cited by: §2.
  • H. Zhang, Z. Kyaw, J. Yu, and S. Chang (2017b) PPR-fcn: weakly supervised visual relation detection via parallel pairwise r-fcn. In Proceedings of the IEEE International Conference on Computer Vision, pp. 4233–4241. Cited by: §2.
  • X. Zhang, J. Feng, H. Xiong, and Q. Tian (2018) Zigzag learning for weakly supervised object detection. arXiv preprint arXiv:1804.09466. Cited by: §6.
  • Y. Zhu and S. Jiang (2018) Deep structured learning for visual relationship detection. In Thirty-Second AAAI Conference on Artificial Intelligence, Cited by: §2.

Appendix A Appendix

a.1 Failure Cases when Relying Solely on Captions

Figure 3: Failure case where the scene graph parser makes errors
Figure 4: Failure case where all captions are insufficiently descriptive
Figure 5: Failure case where the caption does not capture all relationships

In this section, we identify the key failure cases when relying solely on captions. These failures are primarily due to scene graph parser errors, insufficient information present in captions, and the inability of captions to capture all relationships present in the image.

a.1.1 Scene Graph Parser Errors

Generally, the scene graph parser is as effective as using human-constructed scene graphs (Schuster et al., 2015). However, there exist cases where the scene graph generated from the caption by the Stanford Scene Graph Parser is incorrect. For instance, in Figure 5, the parser yields two “yard” nodes. However, we observe that the majority of the errors are caused by the two subsequently described issues.

a.1.2 Insufficient Caption Information

We find that captions describe far less information than actually present in the image. For example, in Figure 5, though there are multiple objects and relations in the image, none of the five captions are able to completely capture everything in the image.

a.1.3 Unable to Capture All Relationships

We find that captions don’t adequately capture all relationships. For example, there are multiple relations such as beneath-behind and up-upward that are not correctly captured. In other cases, some relations actually present in the image are missing– one such example the inability to capture transitive relations. For example, in Figure 5, while the caption indicates that the light pole is next to the airplane, and that the airplane is behind a fence, the caption fails to capture the transitivity (i.e., that the light pole is behind the fence).

a.2 Correlation between Captions and Ground Truth

Our model formulation aims to pool information about subject-predicate-object triplets from the entire corpus of captions, and to use it to densely identify relations between entities in a single image. To validate whether the most common ground truth relation classes are actually present in the captions, we use the Stanford Scene Graph Parser to extract the predicates and compare their frequency counts with the ground truth relations. This correlation is demonstrated in Fig 6.

Figure 6: Correlation between the ground truth triplets and the triplets present in captions
Figure 7: Examples of failures of entity and relation groundings generated by C-GEARD

a.3 Failure to Ground Cluttered Scenes

One failure case for C-GEARD is shown in Figure 7. The main reason for this is the large number of ground truth bounding boxes present in the image, which led to the model being unable to correctly capture the groundings.

a.4 Importance of ResNet Features

We tried two variants of extracting ResNet features given ground-truth bounding boxes. In the first, we used a fully convolutional approach, using the original object sizes. However, we observed extremely poor performance, and hypothesize that classification networks trained on ImageNet are tuned to ignore small objects. To resolve this, we resized objects so that their larger side is of size 224. We observed significantly better performance; consequently, all reported numbers use these features.

To validate that the benefits observed were due to the changed object feature representations, we trained a simple classifier using the 50 VG object classes with a linear layer (we tried other variants as well, but all other results obtained were comparable). We observed a substantial difference in the performance between these two variants: 45% accuracy vs 54% respectively.

a.5 Hyperparameters

We train the top-down attention model with entity attention dimension of 512, tanh non-linearity and batch size of 100. We used both the language model LSTM and the attention LSTM with 1000 hidden cells. The ResNet extracted object features were 2048 dimensional and the word embeddings were initialized to FastText embeddings of 300 dimensions. Finally, we train our model using an Adam optimizer and a learning rate of 0.0001 for 75 epochs. We train the relation classifier using a simple MLP with 2 hidden layers of 64 units each, with dropout of 0.5 using Adam optimizer and learning rate of 0.001 for 50 epochs.

a.6 Model Training and Inference

Figure 8 visually explains the training and inference of our proposed C-GEARD architecture.

Figure 8: C-GEARD’s training and inference.