An Interpretable Model for Scene Graph Generation

by   Ji Zhang, et al.

We propose an efficient and interpretable scene graph generator. We consider three types of features: visual, spatial and semantic, and we use a late fusion strategy such that each feature's contribution can be explicitly investigated. We study the key factors about these features that have the most impact on the performance, and also visualize the learned visual features for relationships and investigate the efficacy of our model. We won the champion of the OpenImages Visual Relationship Detection Challenge on Kaggle, where we outperform the 2nd place by 5% (20% relatively). We believe an accurate scene graph generator is a fundamental stepping stone for higher-level vision-language tasks such as image captioning and visual QA, since it provides a semantic, structured comprehension of an image that is beyond pixels and objects.



There are no comments yet.


page 5


Scene Graph Generation for Better Image Captioning?

We investigate the incorporation of visual relationships into the task o...

Introduction to the 1st Place Winning Model of OpenImages Relationship Detection Challenge

This article describes the model we built that achieved 1st place in the...

SG2Caps: Revisiting Scene Graphs for Image Captioning

The mainstream image captioning models rely on Convolutional Neural Netw...

Affordances Provide a Fundamental Categorization Principle for Visual Scenes

How do we know that a kitchen is a kitchen by looking? Relatively little...

The Visual QA Devil in the Details: The Impact of Early Fusion and Batch Norm on CLEVR

Visual QA is a pivotal challenge for higher-level reasoning, requiring u...

Scene Graph Generation from Objects, Phrases and Region Captions

Object detection, scene graph generation and region captioning, which ar...

Fully Convolutional Scene Graph Generation

This paper presents a fully convolutional scene graph generation (FCSGG)...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Scene graph generation is a fundamental task that bridges low-level vision such as scene parsing and object detection and high-level vision-language problems such as image captioning Lu2018Neural ; yao2018exploring and visual QA antol2015vqa ; johnson2017clevr . It produces a structured semantic understanding of an image from individual objects, and provides rich information for those high-level tasks. Current state-of-the-art methods lu2016visual ; yu17iccv ; Zhuang_2017_ICCV ; plummerPLCLC2017 ; dai2017detecting ; zhang2017visual ; Yang_2018_ECCV ; LiCVPR2017 ; xu2017scenegraph ; zellers2018neural ; Yin_2018_ECCV ; zhang2017relationship ; zhang2018large use three types of features to represent relationships: 1) visual features: the CNN features of the two objects or their combination; 2) spatial features: coordinates of the two objects which encodes their spatial layouts; 3) semantic features

: class labels of the two objects which provide a strong prior of the predicate. Most of them, if not all, combine the three features in an early stage to learn a compositional feature for relationship prediction. The contribution of each feature is thus implicit and probably not optimized. In this paper we propose a structure that instead explicitly builds three branches for the three features, each contributing to the output in an interpretable way, and we fuse their outputs in the final stage to get optimized predictions.

Our contributions are: 1) we propose a new model that efficiently combines three features and show explicitly what each feature contributes to the final prediction and how much the contribution is. 2) we demonstrate the efficacy of our model on three datasets: OpenImages (OI) openimages , Visual Genome (VG) krishnavisualgenome and Visual Relationship Detection (VRD) lu2016visual . We won the 1st place in the OpenImages Challenge, and we outperform state-of-the-art methods on VG and VRD by significant margins.

2 Model Description

The task of visual relationship detection can be defined as a mapping from image to 3 labels and 2 boxes


where stand for labels and boxes, stand for subject, predicate, object. We decompose into object detector

and relationship classifier



The decomposition means that we can run an object detector on the input image to obtain labels, boxes and visual features for subject and object, then use these as input features to the relationship classifier which only needs to output a label. There are two obvious advantages in this model: 1) learning complexity is dramatically reduced, since we can simply use an off-the-shelf object detector as without the need for re-training, hence the learn-able weights exist only in the small subnet ; 2) We have much richer features for relationships, i.e., for , instead of only the image for .

We further assume that the semantic feature , spatial feature and visual feature are independent from each other. So we can build 3 separate branches of sub-networks for them. This is the basic work flow of our model.

Figure 1 shows our model in details. The network takes an input image and outputs the 6 aforementioned features, then each branch uses its corresponding feature to produce a confidence score for predicates, then all scores are added up and normalized by softmax. We now introduce each module’s design and their motivation.

Figure 1: Model Architecture

2.1 Relationship Proposal

A relationship proposal is defined as a pair of objects that is very likely relatedzhang2017relationship . In our model we first detect all meaningful objects by running an object detector, then we simply consider each pair of objects is a relationship proposal. The following modules learn to classify each pair as either “no relationship” or one of the predicates, not including the “is” relationship.

2.2 Semantic Module

Zeller, et al.zellers2018neural introduced a frequency baseline that performs reasonably well on Visual Genome datasetxu2017scene by counting frequencies of predicates given subject and object. Its motivation is that in general cases, the types of relationships between two objects are usually limited, e.g., given the subject being person and object being horse, their relationship is highly likely to be “ride”, “walk”, “feed”, but less likely to be “stand on”, “carry”, “wear”, etc. In short, the composition is usually biased. Furthermore, any specific relationship detection dataset can only contain a limited number of them, making the bias even stronger. This is a factor that we find every useful to leverage.

We improved this baseline by removing the background class of subject and object. Specifically, for each training image we count the occurrence of given in the ground truth annotations, and we end up with an empirical distribution for the whole training set. We do this under the assumption that the test set is also drawn from the same distribution. We then build the remaining modules to learn a complementary residual on top of the output of this baseline.

2.3 Spatial Module

In the challenge dataset, the three predicates “on”, “under”, “inside_of” indicate purely spatial relationships i.e., the relative locations of subject and object are sufficient to tell the relationship. A common solution, as applied in Faster-RCNNren2015faster , is to learn a mapping from visual features to location offsets. However, the learning becomes significantly hard when the distance of two objects are very fargkioxari2017interactnet , which is often the case for relationships. We capture spatial information by encoding the box coordinates of subjects and objects using box deltaren2015faster and normalized coordinates:


where are box delta of two boxes , and are normalized coordinates of box , which are defined as:


where and , are width and height of the image, and are areas of the box and image.

2.4 Visual Module

Visual Module is useful mainly for three reasons: 1) it accounts for all other types of relationships that spatial features can hardly predict, e.g., interactions such as “man play guitar” and “woman wear handbag”; 2) it solves relationship reference problemskrishna2018referring

, i.e., when there are multiple subjects or objects that belong to a same category, we need to know which subject is related to which object; 3) for some specific interactions, e.g., “throw”, “eat” “ride”, the visual appearance of the subject or object alone is very informative about the predicate. With these motivations, we feed subject, predicate, object ROIs into the backbone and get the feature vectors from its last fc layer as our visual features, then we concatenate these three features and feed them into 2 additional randomly initialized fc layers followed by an extra fc layer to get a logit, i.e., unnormalized score. We also add one fc layer on top of the subject feature and another fc layer on top of the object feature to get two scores. These two scores are the predictions made solely by the subject/object feature according to the third reason mentioned above.

2.5 The “is” Relationship

In the OpenImages challenge, “ is

” is also considered as relationships, where there is only one object involved. We achieve this sub-task by using a completely separate, single-branch, Fast-RCNN based model. We use the same object detector to get proposals for this model, then for each proposal the model produces a probability distribution over all attributes with the Fast-RCNN pipeline.

3 Experiments

We present quantitative and qualitative results on OpenImages (OI). We show ablation study on each component of our model. We also show results on Visual Genome (VG) and Visual Relationship Detection (VRD) datasets compared with strong previous methods.

Team ID Public VRD_NN (8th) 0.20643 anokas (7th) 0.21573 MIL (6th) 0.21774 tito (5th) 0.25571 toshif (4th) 0.25621 Kyle (3rd) 0.28043 radek (2nd) 0.28886 Seiji (Ours) 0.33213 Team ID Private [] ZFTurbo (8th) 0.17621 anokas (7th) 0.17960 MIL (6th) 0.19666 radek (5th) 0.20113 toshif (4th) 0.22832 Kyle (3rd) 0.23491 tito (2nd) 0.23709 Seiji (Ours) 0.28544
Table 1: Kaggle leader boards.
R@50 mAP_rel mAP_phr score
Baseline 72.98 26.54 32.77 38.32
74.13 32.41 39.55 43.61
74.46 34.16 39.59 44.39
74.40 34.96 40.70 45.14
Table 2: Ablation Study on OI.
Graph Constraint No Graph Constraint
Recall at 20 50 100 20 50 100 20 50 100 50 100 50 100 50 100
VRDlu2016visual - 0.3 0.5 - 11.8 14.1 - 27.9 35.0 - - - - - -
Associative Embeddingnewell2017pixels 6.5 8.1 8.2 18.2 21.8 22.6 47.9 54.1 55.4 9.7 11.3 26.5 30.0 68.0 75.2
Message Passingxu2017scenegraph - 3.4 4.2 - 21.7 24.4 - 44.8 53.0 - - - - - -
Message Passing+ 14.6 20.7 24.5 31.7 34.6 35.4 52.7 59.3 61.3 22.0 27.4 43.4 47.2 75.2 83.6
Frequency 17.7 23.5 27.6 27.7 32.4 34.0 49.4 59.9 64.1 25.3 30.9 40.5 43.7 71.3 81.2
Frequency+Overlap 20.1 26.2 30.1 29.3 32.3 32.9 53.6 60.6 62.2 28.6 34.4 39.0 43.4 75.7 82.9
MotifNet-NOCONTEXT 21.0 26.2 29.0 31.9 34.8 35.5 57.0 63.7 65.6 29.8 34.7 43.4 46.6 78.8 85.9
MotifNet-LeftRight 21.4 27.2 30.3 32.9 35.8 36.5 58.5 65.2 67.1 30.5 35.8 44.5 47.7 81.1 88.3
Ours 20.8 28.1 32.5 36.1 36.7 36.7 66.7 68.3 68.3 30.1 36.4 48.9 50.8 93.7 97.7
Table 3: Comparison with state-of-the-arts on VG.

max width=1center Relationship Phrase Relationship Detection Phrase Detection free k k = 1 k = 10 k = 70 k = 1 k = 10 k = 70 Recall at 50 100 50 100 50 100 50 100 50 100 50 100 50 100 50 100 DR-Net*dai2017detecting 17.73 20.88 19.93 23.45   - - - - - - - - - - - - ViP-CNNLiCVPR2017 17.32 20.01 22.78 27.91 17.32 20.01 - - - - 22.78 27.91 - - - - VRLliang2017deep 18.19 20.79 21.37 22.60 18.19 20.79 - - - - 21.37 22.60 - - - - PPRFCN*zhang2017ppr 14.41 15.72 19.62 23.75 - - - - - - - - - - - - VTransE* 14.07 15.20 19.42 22.42 - - - - - - - - - - - - SA-Full*Peyre17 15.80 17.10 17.90 19.50 - - - - - - - - - - - - CAI*Zhuang_2017_ICCV 20.14 23.39 23.88 25.26 - - - - - - - - - - - - KL distilationyu17iccv 22.68 31.89 26.47 29.76 19.17 21.34 22.56 29.89 22.68 31.89 23.14 24.03 26.47 29.76 26.32 29.43 Zoom-NetYin_2018_ECCV 21.37 27.30 29.05 37.34 18.92 21.41 - - 21.37 27.30 24.82 28.09 - - 29.05 37.34 CAI + SCA-MYin_2018_ECCV 22.34 28.52 29.64 38.39 19.54 22.39 - - 22.34 28.52 25.21 28.89 - - 29.64 38.39

Ours (ImageNet)

21.62 26.12 28.59 35.18 19.57 22.61 21.62 26.12 21.62 26.12 26.39 31.28 28.59 35.18 28.59 35.18 Ours (COCO) 26.67 32.55 33.29 41.25 24.30 27.91 26.67 32.55 26.67 32.55 31.09 36.42 33.29 41.25 33.29 41.25

Table 4: Results on the VRDlu2016visual dataset ( means unavailable / unknown)

OI: In Table 1 we show the competition results on both the public and private leader board. The score is computed by weight average of three metrics: recall of top 50 predictions (R@50), mean average precision of relationships (mAP_rel), mean average precision of phrases (mAP_phr). The weights for them are 0.2, 0.4, 0.4, respectively. Our model surpasses the 2nd place by 15% relatively on the public dataset and 20% relatively on the private dataset.

VG: We present experimental comparison with state-of-the-art methods on Visual Genome dataset in Table 3. We use the same train/test splits as in xu2017scenegraph

. We use the same evaluation metrics used in

zellers2018neural , which uses three modes: 1) Predicate Classification: predict predicate labels given ground truth subject and object boxes and labels; 2) Scene Graph Classification: predict subject, object and predicate labels given ground truth subject and object boxes; 3) Scene Graph Detection: predict all the three labels and two boxes. Recalls under the top 20, 50, 100 predictions are used as the measurements.

VRD: We compare with state-of-the-art methods on VRD dataset in Table 4. We use the metrics presented in yu17iccv . Note that there is a variable in this metric which is the number of relation candidates when selecting top50/100. Since not all previous methods specified in their evaluation, we first report performance in the “free ” column when considering as a hyper-parameter that can be cross-validated. For methods where the is reported for 1 or more values, the column reports the performance using the best . We then list all available results with specific in the right two columns.

[width=trim=0 0 0 .150pt,clip]man_hold_microphone_gt.jpg

(a) image with gt relationship

[width=trim=0 0 0 .150pt,clip]man_hold_microphone_blob_conv.jpg

(b) feature from conv_body_det

[width=trim=0 0 0 .150pt,clip]man_hold_microphone_blob_conv_prd.jpg

(c) feature from conv_body_rel
Figure 2: Visualization of learned CNN features. (a) shows the image with , (b) shows the convolution feature from the object detector backbone, and (c) shows the feature from the predicate backbone that we train along with the whole model.

Visualization Results: In Figure 2 we show convolution feature maps from the two backbones described in Figure 1 given an image with a ground-truth relationship . It is very clear that the object detector focuses mostly on the contour of the person, while the predicate branch accurately learns to capture the most informative region that represents “holds”, i.e., the intersection of the microphone and the fingers that are holding it. This is the most critical reason why our model performs well.

Ablation Study: We show evaluation results on the validation set of four models with the following settings: 1) baseline: only the semantic module. 2) : using semantic module and visual module without the direct predictions from subject/object. 3) : using semantic module and the complete visual module 4) : our complete model.

Figure 3: Qualitative results

Qualitative Results on OI: We show several example outputs of our model. We can see from Figure 3 that we are able to correctly refer relationships, i.e., when there are multiple people playing multiple guitars, our model accurately points to the truly related pairs. Our model is also able to handle potentially confusing cases, e.g., in the rightmost image, the person is holding the microphone but not playing the drum, though from a coarse view point it looks like his hand is touching the drum.


  • (1) S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. Lawrence Zitnick, and D. Parikh. Vqa: Visual question answering. In

    Proceedings of the IEEE international conference on computer vision

    , pages 2425–2433, 2015.
  • (2) B. Dai, Y. Zhang, and D. Lin. Detecting visual relationships with deep relational networks. In

    2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    , pages 3298–3308. IEEE, 2017.
  • (3) G. Gkioxari, R. Girshick, P. Dollár, and K. He. Detecting and recognizing human-object intaractions. CVPR, 2018.
  • (4) J. Johnson, B. Hariharan, L. van der Maaten, L. Fei-Fei, C. L. Zitnick, and R. Girshick. Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. In Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on, pages 1988–1997. IEEE, 2017.
  • (5) I. Krasin, T. Duerig, N. Alldrin, V. Ferrari, S. Abu-El-Haija, A. Kuznetsova, H. Rom, J. Uijlings, S. Popov, S. Kamali, M. Malloci, J. Pont-Tuset, A. Veit, S. Belongie, V. Gomes, A. Gupta, C. Sun, G. Chechik, D. Cai, Z. Feng, D. Narayanan, and K. Murphy. Openimages: A public dataset for large-scale multi-label and multi-class image classification. Dataset available from, 2017.
  • (6) R. Krishna, I. Chami, M. Bernstein, and L. Fei-Fei. Referring relationships. In IEEE Conference on Computer Vision and Pattern Recognition, 2018.
  • (7) R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. Kalantidis, L.-J. Li, D. A. Shamma, M. Bernstein, and L. Fei-Fei. Visual genome: Connecting language and vision using crowdsourced dense image annotations. 2016.
  • (8) Y. Li, W. Ouyang, and X. Wang.

    Vip-cnn: A visual phrase reasoning convolutional neural network for visual relationship detection.

    CVPR, 2017.
  • (9) X. Liang, L. Lee, and E. P. Xing. Deep variation-structured reinforcement learning for visual relationship and attribute detection. arXiv preprint arXiv:1703.03054, 2017.
  • (10) C. Lu, R. Krishna, M. Bernstein, and L. Fei-Fei. Visual relationship detection with language priors. In ECCV, 2016.
  • (11) J. Lu, J. Yang, D. Batra, and D. Parikh. Neural baby talk. In CVPR, 2018.
  • (12) A. Newell and J. Deng. Pixels to graphs by associative embedding. In Advances in neural information processing systems, pages 2171–2180, 2017.
  • (13) J. Peyre, I. Laptev, C. Schmid, and J. Sivic.

    Weakly-supervised learning of visual relations.

    In ICCV, 2017.
  • (14) B. A. Plummer, A. Mallya, C. M. Cervantes, J. Hockenmaier, and S. Lazebnik. Phrase localization and visual relationship detection with comprehensive image-language cues. In ICCV, 2017.
  • (15) S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pages 91–99, 2015.
  • (16) D. Xu, Y. Zhu, C. Choy, and L. Fei-Fei. Scene graph generation by iterative message passing. In Computer Vision and Pattern Recognition (CVPR), 2017.
  • (17) D. Xu, Y. Zhu, C. B. Choy, and L. Fei-Fei. Scene graph generation by iterative message passing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, volume 2, 2017.
  • (18) X. Yang, H. Zhang, and J. Cai. Shuffle-then-assemble: Learning object-agnostic visual relationship features. In The European Conference on Computer Vision (ECCV), September 2018.
  • (19) T. Yao, Y. Pan, Y. Li, and T. Mei. Exploring visual relationship for image captioning. In ECCV, 2018.
  • (20) G. Yin, L. Sheng, B. Liu, N. Yu, X. Wang, J. Shao, and C. Change Loy.

    Zoom-net: Mining deep feature interactions for visual relationship recognition.

    In The European Conference on Computer Vision (ECCV), September 2018.
  • (21) R. Yu, A. Li, V. I. Morariu, and L. S. Davis. Visual relationship detection with internal and external linguistic knowledge distillation. In The IEEE International Conference on Computer Vision (ICCV), 2017.
  • (22) R. Zellers, M. Yatskar, S. Thomson, and Y. Choi. Neural motifs: Scene graph parsing with global context.
  • (23) H. Zhang, Z. Kyaw, S.-F. Chang, and T.-S. Chua. Visual translation embedding network for visual relation detection. arXiv preprint arXiv:1702.08319, 2017.
  • (24) H. Zhang, Z. Kyaw, J. Yu, and S.-F. Chang. Ppr-fcn: Weakly supervised visual relation detection via parallel pairwise r-fcn. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4233–4241, 2017.
  • (25) J. Zhang, M. Elhoseiny, S. Cohen, W. Chang, and A. Elgammal. Relationship proposal networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5678–5686, 2017.
  • (26) J. Zhang, Y. Kalantidis, M. Rohrbach, M. Paluri, A. Elgammal, and M. Elhoseiny. Large-scale visual relationship understanding. In AAAI, 2019.
  • (27) B. Zhuang, L. Liu, C. Shen, and I. Reid. Towards context-aware interaction recognition for visual relationship detection. In The IEEE International Conference on Computer Vision (ICCV), Oct 2017.