Scene graph generation is a fundamental task that bridges low-level vision such as scene parsing and object detection and high-level vision-language problems such as image captioning Lu2018Neural ; yao2018exploring and visual QA antol2015vqa ; johnson2017clevr . It produces a structured semantic understanding of an image from individual objects, and provides rich information for those high-level tasks. Current state-of-the-art methods lu2016visual ; yu17iccv ; Zhuang_2017_ICCV ; plummerPLCLC2017 ; dai2017detecting ; zhang2017visual ; Yang_2018_ECCV ; LiCVPR2017 ; xu2017scenegraph ; zellers2018neural ; Yin_2018_ECCV ; zhang2017relationship ; zhang2018large use three types of features to represent relationships: 1) visual features: the CNN features of the two objects or their combination; 2) spatial features: coordinates of the two objects which encodes their spatial layouts; 3) semantic features
: class labels of the two objects which provide a strong prior of the predicate. Most of them, if not all, combine the three features in an early stage to learn a compositional feature for relationship prediction. The contribution of each feature is thus implicit and probably not optimized. In this paper we propose a structure that instead explicitly builds three branches for the three features, each contributing to the output in an interpretable way, and we fuse their outputs in the final stage to get optimized predictions.
Our contributions are: 1) we propose a new model that efficiently combines three features and show explicitly what each feature contributes to the final prediction and how much the contribution is. 2) we demonstrate the efficacy of our model on three datasets: OpenImages (OI) openimages , Visual Genome (VG) krishnavisualgenome and Visual Relationship Detection (VRD) lu2016visual . We won the 1st place in the OpenImages Challenge, and we outperform state-of-the-art methods on VG and VRD by significant margins.
2 Model Description
The task of visual relationship detection can be defined as a mapping from image to 3 labels and 2 boxes
where stand for labels and boxes, stand for subject, predicate, object. We decompose into object detector
and relationship classifier:
The decomposition means that we can run an object detector on the input image to obtain labels, boxes and visual features for subject and object, then use these as input features to the relationship classifier which only needs to output a label. There are two obvious advantages in this model: 1) learning complexity is dramatically reduced, since we can simply use an off-the-shelf object detector as without the need for re-training, hence the learn-able weights exist only in the small subnet ; 2) We have much richer features for relationships, i.e., for , instead of only the image for .
We further assume that the semantic feature , spatial feature and visual feature are independent from each other. So we can build 3 separate branches of sub-networks for them. This is the basic work flow of our model.
Figure 1 shows our model in details. The network takes an input image and outputs the 6 aforementioned features, then each branch uses its corresponding feature to produce a confidence score for predicates, then all scores are added up and normalized by softmax. We now introduce each module’s design and their motivation.
2.1 Relationship Proposal
A relationship proposal is defined as a pair of objects that is very likely relatedzhang2017relationship . In our model we first detect all meaningful objects by running an object detector, then we simply consider each pair of objects is a relationship proposal. The following modules learn to classify each pair as either “no relationship” or one of the predicates, not including the “is” relationship.
2.2 Semantic Module
Zeller, et al.zellers2018neural introduced a frequency baseline that performs reasonably well on Visual Genome datasetxu2017scene by counting frequencies of predicates given subject and object. Its motivation is that in general cases, the types of relationships between two objects are usually limited, e.g., given the subject being person and object being horse, their relationship is highly likely to be “ride”, “walk”, “feed”, but less likely to be “stand on”, “carry”, “wear”, etc. In short, the composition is usually biased. Furthermore, any specific relationship detection dataset can only contain a limited number of them, making the bias even stronger. This is a factor that we find every useful to leverage.
We improved this baseline by removing the background class of subject and object. Specifically, for each training image we count the occurrence of given in the ground truth annotations, and we end up with an empirical distribution for the whole training set. We do this under the assumption that the test set is also drawn from the same distribution. We then build the remaining modules to learn a complementary residual on top of the output of this baseline.
2.3 Spatial Module
In the challenge dataset, the three predicates “on”, “under”, “inside_of” indicate purely spatial relationships i.e., the relative locations of subject and object are sufficient to tell the relationship. A common solution, as applied in Faster-RCNNren2015faster , is to learn a mapping from visual features to location offsets. However, the learning becomes significantly hard when the distance of two objects are very fargkioxari2017interactnet , which is often the case for relationships. We capture spatial information by encoding the box coordinates of subjects and objects using box deltaren2015faster and normalized coordinates:
where are box delta of two boxes , and are normalized coordinates of box , which are defined as:
where and , are width and height of the image, and are areas of the box and image.
2.4 Visual Module
Visual Module is useful mainly for three reasons: 1) it accounts for all other types of relationships that spatial features can hardly predict, e.g., interactions such as “man play guitar” and “woman wear handbag”; 2) it solves relationship reference problemskrishna2018referring
, i.e., when there are multiple subjects or objects that belong to a same category, we need to know which subject is related to which object; 3) for some specific interactions, e.g., “throw”, “eat” “ride”, the visual appearance of the subject or object alone is very informative about the predicate. With these motivations, we feed subject, predicate, object ROIs into the backbone and get the feature vectors from its last fc layer as our visual features, then we concatenate these three features and feed them into 2 additional randomly initialized fc layers followed by an extra fc layer to get a logit, i.e., unnormalized score. We also add one fc layer on top of the subject feature and another fc layer on top of the object feature to get two scores. These two scores are the predictions made solely by the subject/object feature according to the third reason mentioned above.
2.5 The “is” Relationship
In the OpenImages challenge, “ is
” is also considered as relationships, where there is only one object involved. We achieve this sub-task by using a completely separate, single-branch, Fast-RCNN based model. We use the same object detector to get proposals for this model, then for each proposal the model produces a probability distribution over all attributes with the Fast-RCNN pipeline.
We present quantitative and qualitative results on OpenImages (OI). We show ablation study on each component of our model. We also show results on Visual Genome (VG) and Visual Relationship Detection (VRD) datasets compared with strong previous methods.
|Graph Constraint||No Graph Constraint|
OI: In Table 1 we show the competition results on both the public and private leader board. The score is computed by weight average of three metrics: recall of top 50 predictions (R@50), mean average precision of relationships (mAP_rel), mean average precision of phrases (mAP_phr). The weights for them are 0.2, 0.4, 0.4, respectively. Our model surpasses the 2nd place by 15% relatively on the public dataset and 20% relatively on the private dataset.
. We use the same evaluation metrics used inzellers2018neural , which uses three modes: 1) Predicate Classification: predict predicate labels given ground truth subject and object boxes and labels; 2) Scene Graph Classification: predict subject, object and predicate labels given ground truth subject and object boxes; 3) Scene Graph Detection: predict all the three labels and two boxes. Recalls under the top 20, 50, 100 predictions are used as the measurements.
VRD: We compare with state-of-the-art methods on VRD dataset in Table 4. We use the metrics presented in yu17iccv . Note that there is a variable in this metric which is the number of relation candidates when selecting top50/100. Since not all previous methods specified in their evaluation, we first report performance in the “free ” column when considering as a hyper-parameter that can be cross-validated. For methods where the is reported for 1 or more values, the column reports the performance using the best . We then list all available results with specific in the right two columns.
Visualization Results: In Figure 2 we show convolution feature maps from the two backbones described in Figure 1 given an image with a ground-truth relationship . It is very clear that the object detector focuses mostly on the contour of the person, while the predicate branch accurately learns to capture the most informative region that represents “holds”, i.e., the intersection of the microphone and the fingers that are holding it. This is the most critical reason why our model performs well.
Ablation Study: We show evaluation results on the validation set of four models with the following settings: 1) baseline: only the semantic module. 2) : using semantic module and visual module without the direct predictions from subject/object. 3) : using semantic module and the complete visual module 4) : our complete model.
Qualitative Results on OI: We show several example outputs of our model. We can see from Figure 3 that we are able to correctly refer relationships, i.e., when there are multiple people playing multiple guitars, our model accurately points to the truly related pairs. Our model is also able to handle potentially confusing cases, e.g., in the rightmost image, the person is holding the microphone but not playing the drum, though from a coarse view point it looks like his hand is touching the drum.
S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. Lawrence Zitnick, and
Vqa: Visual question answering.
Proceedings of the IEEE international conference on computer vision, pages 2425–2433, 2015.
B. Dai, Y. Zhang, and D. Lin.
Detecting visual relationships with deep relational networks.
2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3298–3308. IEEE, 2017.
- (3) G. Gkioxari, R. Girshick, P. Dollár, and K. He. Detecting and recognizing human-object intaractions. CVPR, 2018.
- (4) J. Johnson, B. Hariharan, L. van der Maaten, L. Fei-Fei, C. L. Zitnick, and R. Girshick. Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. In Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on, pages 1988–1997. IEEE, 2017.
- (5) I. Krasin, T. Duerig, N. Alldrin, V. Ferrari, S. Abu-El-Haija, A. Kuznetsova, H. Rom, J. Uijlings, S. Popov, S. Kamali, M. Malloci, J. Pont-Tuset, A. Veit, S. Belongie, V. Gomes, A. Gupta, C. Sun, G. Chechik, D. Cai, Z. Feng, D. Narayanan, and K. Murphy. Openimages: A public dataset for large-scale multi-label and multi-class image classification. Dataset available from https://storage.googleapis.com/openimages/web/index.html, 2017.
- (6) R. Krishna, I. Chami, M. Bernstein, and L. Fei-Fei. Referring relationships. In IEEE Conference on Computer Vision and Pattern Recognition, 2018.
- (7) R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. Kalantidis, L.-J. Li, D. A. Shamma, M. Bernstein, and L. Fei-Fei. Visual genome: Connecting language and vision using crowdsourced dense image annotations. 2016.
Y. Li, W. Ouyang, and X. Wang.
Vip-cnn: A visual phrase reasoning convolutional neural network for visual relationship detection.CVPR, 2017.
- (9) X. Liang, L. Lee, and E. P. Xing. Deep variation-structured reinforcement learning for visual relationship and attribute detection. arXiv preprint arXiv:1703.03054, 2017.
- (10) C. Lu, R. Krishna, M. Bernstein, and L. Fei-Fei. Visual relationship detection with language priors. In ECCV, 2016.
- (11) J. Lu, J. Yang, D. Batra, and D. Parikh. Neural baby talk. In CVPR, 2018.
- (12) A. Newell and J. Deng. Pixels to graphs by associative embedding. In Advances in neural information processing systems, pages 2171–2180, 2017.
J. Peyre, I. Laptev, C. Schmid, and J. Sivic.
Weakly-supervised learning of visual relations.In ICCV, 2017.
- (14) B. A. Plummer, A. Mallya, C. M. Cervantes, J. Hockenmaier, and S. Lazebnik. Phrase localization and visual relationship detection with comprehensive image-language cues. In ICCV, 2017.
- (15) S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pages 91–99, 2015.
- (16) D. Xu, Y. Zhu, C. Choy, and L. Fei-Fei. Scene graph generation by iterative message passing. In Computer Vision and Pattern Recognition (CVPR), 2017.
- (17) D. Xu, Y. Zhu, C. B. Choy, and L. Fei-Fei. Scene graph generation by iterative message passing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, volume 2, 2017.
- (18) X. Yang, H. Zhang, and J. Cai. Shuffle-then-assemble: Learning object-agnostic visual relationship features. In The European Conference on Computer Vision (ECCV), September 2018.
- (19) T. Yao, Y. Pan, Y. Li, and T. Mei. Exploring visual relationship for image captioning. In ECCV, 2018.
G. Yin, L. Sheng, B. Liu, N. Yu, X. Wang, J. Shao, and C. Change Loy.
Zoom-net: Mining deep feature interactions for visual relationship recognition.In The European Conference on Computer Vision (ECCV), September 2018.
- (21) R. Yu, A. Li, V. I. Morariu, and L. S. Davis. Visual relationship detection with internal and external linguistic knowledge distillation. In The IEEE International Conference on Computer Vision (ICCV), 2017.
- (22) R. Zellers, M. Yatskar, S. Thomson, and Y. Choi. Neural motifs: Scene graph parsing with global context.
- (23) H. Zhang, Z. Kyaw, S.-F. Chang, and T.-S. Chua. Visual translation embedding network for visual relation detection. arXiv preprint arXiv:1702.08319, 2017.
- (24) H. Zhang, Z. Kyaw, J. Yu, and S.-F. Chang. Ppr-fcn: Weakly supervised visual relation detection via parallel pairwise r-fcn. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4233–4241, 2017.
- (25) J. Zhang, M. Elhoseiny, S. Cohen, W. Chang, and A. Elgammal. Relationship proposal networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5678–5686, 2017.
- (26) J. Zhang, Y. Kalantidis, M. Rohrbach, M. Paluri, A. Elgammal, and M. Elhoseiny. Large-scale visual relationship understanding. In AAAI, 2019.
- (27) B. Zhuang, L. Liu, C. Shen, and I. Reid. Towards context-aware interaction recognition for visual relationship detection. In The IEEE International Conference on Computer Vision (ICCV), Oct 2017.