Image understanding based on individual instances, , object detection, classification and segmentation, has witnessed significant advancement in the past decade. Usually, a natural image consists of multiple instances in a scene, and most of them are interacting in certain ways. To go a step further, deeper image understanding requires a holistic view of modelling interactions between individual objects. Visual relationship [14, 4, 25, 30, 32] , which encodes the interplay between instances, becomes the indispensable factor for high-level image understanding tasks, such as image retrieval
, which encodes the interplay between instances, becomes the indispensable factor for high-level image understanding tasks, such as image retrieval, captioning , and visual question answering . In existing literature, visual relationship are mostly represented as a scene graph as shown in Figure 1, and the relationship is represented to a triplet form as intuitively.
Similar to most deep learning researches, the image datasets serve as the backbone for visual relation understanding. Up to now, Visual Genome (VG) data set
Similar to most deep learning researches, the image datasets serve as the backbone for visual relation understanding. Up to now, Visual Genome (VG) data set provide the largest relationship combinations, and there are nearly 21 pairwise relationships for each image in VG. It offers abundant data for relationships studying. Since the diverse annotations and the large image amount the VG has, most of the works on relationships choose VG as their study target. However, the relationships in VG have lots of noise or duplications since they are extracted from image captions. When utilizing VG dataset, pre-processing the datasets and constructing a high quality subset for different tasks is necessary. Specifically, VG150 111VG150 is a split set of VG, which is constructed by the top 150 objects categories and 50 relation categories in Visual Genome., as the most widely used data split, is utilized to relation detection [33, 13] and scene graph generation task [30, 25]. An snapshot of the scene graph generated from VG150 is shown in Figure 2 (a).
However, VG150 is constructed according to the frequency of relationship labels, in which only relationships with high frequency are kept.
The semantic information is not taken into account when constructing the dataset. We found that the relationship data in VG150 has a severe defect, which cause the relation representation heavily relies non-visual information like bias of datasets, language prior of labels, etc., rather than inferring semantic knowledge from images.
It can be found that relationships like ‘‘on’’, ‘‘in", ‘‘of", ‘‘at"m etc. take the dominating majority in VG150 dataset as shown in Figure 3 . However, these relationships are easy to be inferred merely based on the bounding box locations. In this situation, the visual semantic information becomes unnecessary for predicting. Further, if the relationships are determined by instance location, the relation representation task would degrade into the single instance detection.
Moreover, Some relationships like ‘‘wear", ‘‘ride", and ‘‘has" can be easily estimated by language priors or statistical measures of the dataset. In VG150, as shown in Figure
. However, these relationships are easy to be inferred merely based on the bounding box locations. In this situation, the visual semantic information becomes unnecessary for predicting. Further, if the relationships are determined by instance location, the relation representation task would degrade into the single instance detection. Moreover, Some relationships like ‘‘wear", ‘‘ride", and ‘‘has" can be easily estimated by language priors or statistical measures of the dataset. In VG150, as shown in Figure4, The 95.78% of labeled relationships between subject ‘‘man" and object ‘‘nose" is ‘‘has". Similarly, when taking ‘‘man" as subject and ‘‘jacket" as object, 69.32% of labeled relationship is ‘‘wearing". The statistical regularity of the dataset brings certainty and definition to the task. The missing of uncertainty and diversity of data causes the methods stuck by statistical bias and fallen into the frequency symptom.
Aforementioned relationships have very limited relevance with visual semantic information in images. Modeling these relationships based on visual information is just beating a dead horse. Researches work on understanding visual scene suffering from the bias and interference introduced by these relationships for a long while. With the defeats of data, two adverse effects are exposed to relationships researches:
Sensitive to the performance of object detection. In the condition of given ground truth bounding boxes and instance categories, the prediction of relationships reaches 98.4% in R100 222 and computes the correct relationships number in the top- and top- confident relationship predictions respectively.. In VG150, the performances of relation representing usually significantly boost by introducing ground truth instance locations and classes [30, 25, 25]. Due to the negative influence of visually-irrelevant relationships, many relation labels are predictable by accurate objects localization, which lead the optimization of scene graph into learning more accurate detection results.
Highly Predictable by Statistical Priors. In scene graph generation, previous work  points out that the reasonable results are available in current scene graph generation tasks only by frequency analysis. The performance of frequency counting baseline even defeats several former works with learnable methods, which means many relation labels can be predicted according to visually-irrelevant factors. While the high predictability depending on statistical regularity is also contrary to the original intention of understanding image semantics.
All these facts reveal that current studies in relationships get stuck and relation information in images has not been fully explored.
To avoid above defects from relation data and construct a better dataset, we propose a novel method to automatically discriminate visually-relevant relationships and construct a new scene graph dataset named Visually-relevant Relationships in Visual Genome (VrR-VG). 333The dataset is available at https://vrrvg.github.io/ .. Figure 2 (b) shows a scene graph snapshot in VrR-VG.
VrR-VG is more balance and contains more visually-relevant relationships as shown in Figure 5. To demonstrate the difficulty of our VrR-VG, we report the performance of several methods in scene graph generation . The experiments show significant performance decrease in all the metrics when adopting our data splits. We also prove that barely frequency analysis no longer works in VrR-VG. Moreover, the features trained from VrR-VG have convincible results in some text-image multimodal application like Visual Question Answering (VQA), which also indicates that more complex semantic information contained in our dataset than VG150. All of these results provide further confirmations of the defects in previous scene graph datasets and also exposes insufficiency in current relationships research works.
The main contributions of this paper are summarized as follows:
A novel method for selecting visually-relevant and valuable relation label is proposed. Both of the Positional Relationships and Statistically Biased Relationships are detected by using this method. Complex and visually-relevant relationships are reserved. Depending on the proposed method, a new scene graph dataset VrR-VG is constructed. Our dataset has a comparable scale with previous datasets but more valuable and visually-relevant relationships are included.
A feature embedding method trained with location, categories, attributes of single instance, and the relationships among instances. With the proposed method, more semantic information in relationships is added into features.
The performance improvements in VQA task and more complicated predicates appeared in caption result demonstrate that VrR-VG contains more valuable visual relationships than previous scene graph dataset, and the visual representations learned on our dataset have a higher ability in semantic expression.
2 Related Work
|Dataset||# Ojects||# Relations||# Images|
|Visual Phrase ||8||9||2,769|
|Scene Graph ||266||68||5,000|
|Open Image ||57||10||1,743,042|
|Visual Genome ||33,877||40,480||108,077|
|VG150 [25, 30]||150||50||87,670|
We list all relationship related datasets in Table 1. Visual phrase dataset  focused on relation phrase recognition and detection. In the visual phrase, a relation triplet is annotated as a whole target. Visual phrase dataset  contained 8 objects categories from Pascal VOC2008  and 17 relation phrases with 9 different relationships. Scene graph dataset was proposed in  and mainly explored the ability of image retrieval by scene graph. The VRD dataset  intended to benchmark the relationship detection task, and contains 37,993 relation triplets with 6,672 unique triplets Open image  provided the largest amount of images. It was a brilliant resource for object detection and also presented a challenging task for relationship detection. PIC  is the newest among all the datasets we listed. It proposed an interesting task for combination the instance segmentation and relationships. In stead of taking detection bounding boxes and classes as supervision, the segmentation masks of instances became the main goal in PIC. Moreover, UnRel  dataset focus on some rare relationships and present a new task for relation presentation in weakly-supervision. While due to the main target of UnRel is representing uncommon relationships, the amount of annotations is limited. There are only 1071 images in UnRel dataset, which are hard to be adapted in a supervised task.
Visual Genome (VG)  had the maximum amount of relation data with the most diverse objects and relation categories in all listed datasets. As shown in Table 1, millions of object labels, relation triplets were contained in VG. However, the relations in VG are extracted from image captions, which contains lots of noisy and duplicate relations. Thus VG150 is constructed by processing VG dataset and only high frequency relationships are kept. However, most of high frequency relationships are visual irrelevance as we mentioned before.
Numerous deep learning methods have been proposed for image representation learning. These methods offer two aspects in image understanding, which representation with single instance and multiple instances. In single instance, GoogLNet , Resnet , Inception , ResNext  , etc. are trained with Imagenet
, etc. are trained with Imagenet dataset and focus on the single instance classification. Since the supervise labels are instance categories, the methods tend to give a holistic representation of images and figure out the features with the single salient instance attentioned. Furthermore, as more than one instances exist in the images, focusing on one salient instance is not enough to represent the scenes in images. To explore multiple instances, methods in detection task provide some inspirational thoughts. Jin et al. adapt selective search  to give salience region proposals. The similar idea also appears in RCNN , in which the network achieve many region proposals first and work out detection result for every instance. Faster-RCNN  further improved the idea of region proposals and provide a faster and more elegant method to limited region proposals. Established by Faster-RCNN’s region proposals, Peter et al.  proposed a bottom-up and top-down attention method to represent images. They utilize the locations, categories and attributes of instances to learn the representation and get improvement in some semantic tasks. In our work, we go deeper in multiple instances representation by adding inter-instance information in features. All the isolated instance information like locations, categories, attributes, together with instances relationships are all fully utilized in representation learning.
3 Visually-relevant Relationships Dataset
In this section, we introduce a novel visually-relevant relationship discriminator (VD). VD is a simple fully connected network aims to recognize relation labels directly by entities classes and bounding boxes. Since there no visual information feeding to V D, relationships have high predication accuracy would be tend to regard as visually-irrelevant relationships. After filtering out these visually-irrelevant relationships and reducing duplicate relationships by hierarchical clustering, we constructed a new dataset named Visually-relevant Relationships Dataset (VrR-VG) from original VG dataset.
D, relationships have high predication accuracy would be tend to regard as visually-irrelevant relationships. After filtering out these visually-irrelevant relationships and reducing duplicate relationships by hierarchical clustering, we constructed a new dataset named Visually-relevant Relationships Dataset (VrR-VG) from original VG dataset.
3.1 Visually-relevant Relationship Discriminator
To distinguish visually-irrelevant relationships, a hypothesis is proposed that, if a relationship is predictable according by any information except visual information, the relationship is visually-irrelevant. In our work, a simple visually-relevant relationship discriminator (VD) is proposed for selecting relationships. To prevent the influence of overfitting caused by large amount of parameters, the network structure design follows the guideline ‘‘tinier is better’’. Our V D aims to recognize relationships without providing any kinds of visual information. Specifically, the inputs are word vectors of instances’ category learned from Natural Language corpus and the coordinates of the corresponding bounding boxes. Glove
D aims to recognize relationships without providing any kinds of visual information. Specifically, the inputs are word vectors of instances’ category learned from Natural Language corpus and the coordinates of the corresponding bounding boxes. Glove is adapted for word vector.
Each bounding box of instance in the image can be defined by a four-tuple that specifies its top-left corner and its height and width . Here we denote the position embedding for subject and object as and respectively, where and . The bounding boxes set of given object and subject in related entities are embedded to a jointly vector as following equation:
where are offsets of boxes computed by the difference between the coordinates of subject and object, and are width and height of bounding box of subject and object respectively, and and are the coordinates of the center of the corresponding boxes. Above position embedding values provide position information for our network.
The network details are given in Figure 6 where and are the word vectors of subject and object categories. are learnable weights in VD network. The word vectors of object and subject are processed by a simple fully-connected layer. Then, the output features are concatenated with position embedding and , and those reserved relationships are selected for generating the dataset. In this paper, we set as 50% due to the trade-off between dataset scale and visually-relevant semantic quality.
The V D merely contains three fully-connected layers, but it is already sufficient to predict most of the visually-irrelevant relationships like ‘‘wear", ‘‘on", ‘‘above", etc.
It is worth noting that more than 54% of relation labels in VG150 can be predicted with at least 50% accuracy by using such a rude neural network without any visual information.
D merely contains three fully-connected layers, but it is already sufficient to predict most of the visually-irrelevant relationships like ‘‘wear", ‘‘on", ‘‘above", etc. It is worth noting that more than 54% of relation labels in VG150 can be predicted with at least 50% accuracy by using such a rude neural network without any visual information.
3.2 Dataset Construction
We pre-process VG and extract top 1600 objects and 500 relationships to generate a basic data split. The raw relation labels in scene graph contain many duplications, such as ‘‘wears" and ‘‘is wearing a", ‘‘next" and ‘‘next to". Those labels may confuse network because all those labels are correct to the same object and subject combination. We represent the labels by Glove word vector, and then cluster the vectors by the hierarchical method. This simple operation reduces label categories from 500 to 180, which is more diverse than the formal 20 relations in VG150 and reserve sufficient semantic level information. Then, to exclude visually-irrelevant relationships, the VD network is utilized to train and evaluate with the 180 clustered labels. Finally, we get 117 relation labels as VrR-VG relationships.
In the comparison of image amount, the VG150 data split, which was adapted in many relationships representation works [30, 25], contains 87670 images and 588586 triplet pairs. VrR-VG has 58983 images and 23375 relation pairs, which is less than VG150. However, as shown in Figure 7, the distribution of our dataset is more balance. Since VG150 is generated depending on the label frequency, the easy and general relationships are inevitably involved in the dataset. Considering the data bias in original Visual Genome, the imbalance data distribution inherits when generating data split for both of VG150 and VrR-VG.
Moreover, labels like ‘‘on", ‘‘of", etc. in VG150 are too trivial to describe entities interaction, and there are also some relationships having same semantic meaning appear in the top relationships in VG150, e.g. ‘‘wears" and ‘‘wearing" are regard as two different relationships and accounting for 11.87% and 0.84% in VG150 respectively. Comparatively, our top12 labels are more significant in the semantic domain. Relationships like ‘‘hanging on", ‘‘playing with", etc. are hard to be estimated without enough understanding in corresponding scenes. So, our data split is much more difficult in semantic representation in visual. We also show some exemplar scene graphs with randomly sampled images in Figure A.1. It can be found that, compares with previous VG150 dataset, the relationships in our dataset are more diverse and contain more semantic information for describing the scene.
4 Informative Visual Representation Learning
As shown in Figure 8, to modeling entire visual information in an image, the properties of isolated objects like category, position, attribute and the interactions among objects which are expressed as relationships are all useful. In our framework, all the properties are utilized as supervision for training feature. We extract single objects proposals by a detector, and then train the model with objects categories, locations, attributes, and relationships among instances.
Specifically, Faster-RCNN  with Resnet101  is adapted as instance detector in our framework. We apply Non-maximum suppression (NMS) operation on regions proposals and then select candidate proposals according to IOU threshold. Then, through a mean-pooling layer, proposals’ features are integrated into the same dimensions.
To learn the single instance properties, together with original detection operation, we also set an attributes classifier to learn instance attributes. Thus the overall isolated properties are learned as follow:
where , , , , and are learnable parameters, is concatenate operation. , , and are the bounding boxes, class and attributes predictions for -th instance.
Meanwhile, learning the interactions among instances described by relation data play a important role in high-level semantic tasks, especially the reasoning task like question and answering. Specifically, we achieve the relationships between entities by the following equation:
where and are learnable parameters for mapping instance to relation domain, are nodes for mapping isolated instance features to relation domain, and is the relation prediction of proposal instance and . In relation training, the proposal features are mapped into the relation space first by a fully-connected layer. Then, we fuse the mapped features to get relation labels between proposals. Since there are proposals in our works, all the relation combinations participate in features training. The ground truth labels are allocated by anchor settings and detection ROI. The target labels are all the relations in VrR-VG and an additional no relation label.
In training procedure, locations, categories, attributes of single entities and the entities relationships participate and supervise features learning. As a result, the features contains all the information of isolated entities and all the interaction among entities. We utilize the final features in VQA and caption task to evaluate the ability of the features in semantic expression.
|Method specific VG splits||VrR-VG|
|Message Passing ||R50||20.7||34.6||59.3||R50||8.46||12.1||29.7|
In this section, we discuss the properties of our data split in two sides. One is dataset comparison on traditional scene graph generation task. The other is dataset quality evaluation by applying the visual representations learned from different datasets on semantic or reasoning task like visual question answering and image captioning.
5.1 Scene Graph Generation
To estimate the relationships reserved in our split, we compare VrR-VG dataset with other relation datasets on scene graph generation task. As we known, scene graph generation needs to learn semantic in images and infer the connections of entities in semantic level. We mainly evaluate and discuss the latest or widely used scene graph generation methods, including MSDN , Vtrans , Neural-Motifs  and Message-Passing in our experiments. Additionally, Neural-motifs, Message-passing use pre-trained Faster RCNN with VGG backbone as object detector. The MAP of the detector is 20.0 and 8.2 in VG150 and our VrR-VG at 50% IOU.
With the experiments in some relation representation methods, as shown in Table 2, 3, and 4, the performances apparently decrease when adapting our data split. With the relationships selected by our method, the relation representation task becomes more difficult and challenging.
We also evaluate the performance of frequency-based method  which uses pre-trained detector in neural-motifs method and then predicts relation labels by statistic label probabilities in training set without visual images inputs. As shown in Table
which uses pre-trained detector in neural-motifs method and then predicts relation labels by statistic label probabilities in training set without visual images inputs. As shown in Table3, our dataset efficiently reduce the impact of language prior and data bias. The frequency-based method does not practicable any more. In the previous relation dataset VG150, without images input, the statistical method can get similar results to the methods without visual information. Results only decline 3.7, 3.4, and 5.3 in R50 of scene graph detection (SGDet), scene graph classification (SGCls), and predicate classification (PredCls). However, in our data split, 14.7, 23.9, and 23.4 descents appear in the performance of the statistical method, which is far from the models learned from visual information. It indicates that the margin between visual based and non-visual based method have much bigger performance gap on our VrR-VG dataset than is previous widely used VG150 dataset. The performance of the frequency based method is efficiently inhibited in our dataset. The true abilities of semantic understanding can be shown with our dataset.
Further, the results of the frequency based method also reveal that the location information can also be prior to relationships. In VG150, with bounding box ground truth as inputs, the SGCls has 8.9 and 6.4 improvements in R50 and 100, which indicates that the entities positions also contain bias and can help models estimate relation labels without visual information. While, in our data split, performances of frequency based method decease after appending location of entities, which means our data split eliminates positional influence and enforce the models insensitive to object detection.
Moreover, the predicate detection task (PredDet), as the easiest task in relation representation with related paired entities and detection ground truth, stands for the maximum performance theoretically. As shown in Table 4, although the results decrease in our dataset, methods learned with visual information also get convincible values in PredDet. Both Neural-Motifs  and Message-Passing  methods have results larger than 80 and 90 in R50 and R100. The facts indicate that the relationships selected in our dataset are representable and with fine-designed methods, problems in our split are not incompatible and infeasible. Meanwhile, The values of R50 and R100 are merely 69.8 and 78.1 when adapting frequency based method. So, the frequency-based does not work as well in the PredDet metric, which means the methods for our dataset must explore and infer visually-relevant and valuable relationships relying on the understanding of images.
In the comparison of R50 and R100, the difference of two metrics in all the method improves. As shown in Table 2, the performance gaps of Neural-Motifs  between R50 and R100 are 3.1, 0.7 and 1.9 in SGDet, SGCls and PredCls in VG150. While, the gaps in VrR-VG are 2.6, 2.7 and 5.8. Similar to neural-motifs, Message-Passing  also has the same tendencies, which are the lower difference in SGDet and the higher difference in SGCls and PredCls. It is remarkable in the descents in the gap of SGDet. Due to this metric only use images as inputs, models are asked to infer relationships depending on visual information completely, which just correspond to our visually-relevant relationships. So, the values in SGDet should have smaller volatility in our dataset. Furthermore, with more accurate location and category information inputs, the models have more choices to infer relationships. In SGCls and PredCls metrics, the ground truth bounding boxes and classes help models to learn data bias better. So, the volatilities become smaller than SGDet task in VG150. On the other side, in our split, because the relation labels become harder for models, labels hardly estimated in top50 do not appear in top100 as well, which means the mistake the models made in top50 are not able to be fixed in top100 ranking. Thus, larger gaps arise in our VrR-VG. The results reveal that our dataset put forward a higher requirement for models to estimate relationships by visual information and hit the target with more semantic value.
5.2 Representations Learning for Semantic Tasks
To evaluate the relation quality in semantic level, we choose VQA and caption tasks in experiments and adapt our split to inspect semantic improvements of the dataset. We also compared our informative visual representation learning method with traditional visual representation learning method, the experimental results demonstrate that the visually-relevant relationship play a important role in high-level visual understanding tasks. For the implementation of our proposed informative visual representation method, following the settings of previous bottom-up and top-down method , we set for all experiments in this section and image features are integrated into 2048 dimensions by mean-pooling. Additionally, we introduce a new dataset VrR-V, which is generated by excluding relation data from VrR-VG, for ablation study. We apply our proposed feature learning for VrR-V too, but without the weight of the relationship predication loss is set as 0.
We applied two widely used VQA methods MUTAN  and MFH  for evaluating the quality of image feature learned from different datasets. Table 5 reports the experimental results on validation set of VQA-2.0 dataset. We can find that features trained with our VrR-VG obtain the best performance in all the datasets. We also compared the dataset used in Bottom-Up attention , which is regard as the strongest feature representation learning method for VQA.
In details, with relation data, our complete VrR-VG performs better than dataset used in Bottom-Up attention and VrR-V. The results indicate that the relation data is useful in VQA task, especially in the highly-semantic related questions. It also demonstrate that our proposed informative visual representation method can extract more useful features from images.Besides, we also apply our proposed feature learning method on VG150 dataset. Since VG150 contains too many visually-irrelevant relationships which can be inferred easily by data bias as we mentioned. As a result, the features learned from VG150 usually lack of the ability to represent complex visual semantics. The higher quality of relation data energize the features learned from our dataset and lead to a better performance in the open-ended VQA task.
In this part, we adapt our constructed dataset VrR-VG and our proposed entity, attributes, localizations and interactions jointly feature representation learning method to image captioning task. Similar with the experiment process used in VQA task, we first generate the image features based on VG150, VrR-V, and VrR-VG respectively. Then we apply the caption model proposed in  for these image features with same settings in fair.
The experiments results are shown in Table 6. As shown in the table, we report the performances in our split and VG150 in both original optimizer for cross entropy loss and CIDEr optimizer for CIDEr score. Features generated from our data split works better then VG150. All the metrics in caption have better performance when using both of two different optimizers. Moreover, in the comparison of adding relation or not, Our complete VrR-VG has better performance than the VrR-V which only contains objects information in original VrR-VG. This indicates that the relation information is utilized and useful. The improvement shows the efficiency of our dataset and also reveal the power of relationships in semantic tasks.
Specifically, to show the superiority of the relation data in our split, as shown in Figure10, some examples are given to show the difference of features in VG150 and VrR-VG. To fairly compare, both caption examples have the accurate recognition in objects. We focus on the difference of predicates, which can better reflect the influence of relation data. In examples of caption results, our features tend to have more diverse predicates and more vivid description. Rather than some simple predicates like ‘‘on", ‘‘standing", etc., our features provide more semantic information and help models achieve more complex expression like ‘‘writing", ‘‘grazing", etc.
In this paper, we design a tiny network as relation discriminator to select visually-relevant relationships. With the selected relationships, a visually-relevant relationships split in Visual genome is generated in our work. Compared with original VG splits, this split contains more valuable relationships and more semantic information, which are hard to estimate merely by statistical bias or detection ground truth. We also proposed an informative visual representation learning method which is designed to learning image feature according to entities‘ label, localization, attributes and interactions among entities jointly. The significant improvements on VQA and image captioning task demonstrate that: (1) our constructed dataset VrR-VG has better quality than previous datasets, (2) visual relationship information is helpful for high-level semantic understanding tasks, (3) our proposed informative visual representation learning method can effectively model different kinds of visual information jointly.
-  picdataset.com. http://picdataset.com/challenge/index/, 2018.
-  P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L. Zhang. Bottom-up and top-down attention for image captioning and visual question answering. In , pages 6077--6086, 2018.
-  H. Ben-younes, R. Cadène, M. Cord, and N. Thome. MUTAN: multimodal tucker fusion for visual question answering. CoRR, abs/1705.06676, 2017.
-  B. Dai, Y. Zhang, and D. Lin. Detecting visual relationships with deep relational networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pages 3298--3308, 2017.
-  J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A Large-Scale Hierarchical Image Database. In CVPR09, 2009.
-  M. Everingham, L. J. V. Gool, C. K. I. Williams, J. M. Winn, and A. Zisserman. The pascal visual object classes (VOC) challenge. International Journal of Computer Vision, 88(2):303--338, 2010.
-  R. B. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. CoRR, abs/1311.2524, 2013.
-  K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, pages 770--778, 2016.
-  J. Jin, K. Fu, R. Cui, F. Sha, and C. Zhang. Aligning where to see and what to tell: image caption with region-based attention and scene factorization. CoRR, abs/1506.06272, 2015.
-  J. Johnson, R. Krishna, M. Stark, L. Li, D. A. Shamma, M. S. Bernstein, and F. Li. Image retrieval using scene graphs. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, June 7-12, 2015, pages 3668--3678, 2015.
-  R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. Kalantidis, L. Li, D. A. Shamma, M. S. Bernstein, and F. Li. Visual genome: Connecting language and vision using crowdsourced dense image annotations. CoRR, abs/1602.07332, 2016.
-  Y. Li, W. Ouyang, B. Zhou, K. Wang, and X. Wang. Scene graph generation from objects, phrases and region captions. In IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017, pages 1270--1279, 2017.
X. Liang, L. Lee, and E. P. Xing.
Deep variation-structured reinforcement learning for visual relationship and attribute detection.In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pages 4408--4417, 2017.
-  C. Lu, R. Krishna, M. S. Bernstein, and F. Li. Visual relationship detection with language priors. In Computer Vision - ECCV 2016 - 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part I, pages 852--869, 2016.
-  P. Lu, L. Ji, W. Zhang, N. Duan, M. Zhou, and J. Wang. R-VQA: learning visual relation facts with semantic attention for visual question answering. CoRR, abs/1805.09701, 2018.
J. Pennington, R. Socher, and C. D. Manning.
Glove: Global vectors for word representation.
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, October 25-29, 2014, Doha, Qatar, A meeting of SIGDAT, a Special Interest Group of the ACL, pages 1532--1543, 2014.
J. Peyre, I. Laptev, C. Schmid, and J. Sivic.
Weakly-supervised learning of visual relations.In ICCV, 2017.
-  S. Ren, K. He, R. B. Girshick, and J. Sun. Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell., 39(6):1137--1149, 2017.
-  S. J. Rennie, E. Marcheret, Y. Mroueh, J. Ross, and V. Goel. Self-critical sequence training for image captioning. CoRR, abs/1612.00563, 2016.
-  M. A. Sadeghi and A. Farhadi. Recognition using visual phrases. 2011.
-  C. Szegedy, S. Ioffe, and V. Vanhoucke. Inception-v4, inception-resnet and the impact of residual connections on learning. CoRR, abs/1602.07261, 2016.
-  C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. E. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. CoRR, abs/1409.4842, 2014.
-  J. R. R. Uijlings, K. E. A. van de Sande, T. Gevers, and A. W. M. Smeulders. Selective search for object recognition. International Journal of Computer Vision, 104(2):154--171, 2013.
-  S. Xie, R. B. Girshick, P. Dollár, Z. Tu, and K. He. Aggregated residual transformations for deep neural networks. CoRR, abs/1611.05431, 2016.
-  D. Xu, Y. Zhu, C. B. Choy, and L. Fei-Fei. Scene graph generation by iterative message passing. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pages 3097--3106, 2017.
-  D. Xu, Y. Zhu, C. B. Choy, and L. Fei-Fei. Scene graph generation by iterative message passing. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pages 3097--3106, 2017.
-  T. Yao, Y. Pan, Y. Li, and T. Mei. Exploring visual relationship for image captioning. In Computer Vision - ECCV 2018 - 15th European Conference, Munich, Germany, September 8-14, 2018, Proceedings, Part XIV, pages 711--727, 2018.
-  T. Yao, Y. Pan, Y. Li, Z. Qiu, and T. Mei. Boosting image captioning with attributes. CoRR, abs/1611.01646, 2016.
-  Z. Yu, J. Yu, C. Xiang, J. Fan, and D. Tao. Beyond bilinear: Generalized multi-modal factorized high-order pooling for visual question answering. CoRR, abs/1708.03619, 2017.
-  R. Zellers, M. Yatskar, S. Thomson, and Y. Choi. Neural motifs: Scene graph parsing with global context. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, pages 5831--5840, 2018.
-  H. Zhang, Z. Kyaw, S. Chang, and T. Chua. Visual translation embedding network for visual relation detection. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pages 3107--3115, 2017.
-  J. Zhang, Y. Kalantidis, M. Rohrbach, M. Paluri, A. M. Elgammal, and M. Elhoseiny. Large-scale visual relationship understanding. CoRR, abs/1804.10660, 2018.
-  L. Zhou, J. Zhao, J. Li, L. Yuan, and J. Feng. Object relation detection based on one-shot learning. CoRR, abs/1807.05857, 2018.
Appendix A Appendix
a.1 Scene Graph Data Comparison
Some scene graph examples are given here. In VrR-VG, only visually-relevant relationships are contained. As shown in figures, scene graph generated from VrR-VG tend to have more complicate semantic expressions which can offer more diverse and sufficient information for multiple tasks.
a.2 Image Captioning Results Comparison
In this section, as shown in Fig 10, we provide more caption results to show the difference between features learned from VG150 and ours. With more visually-relevant relationships, the predicates in captions become more diverse.