Large-Scale Visual Relationship Understanding

04/27/2018 ∙ by Ji Zhang, et al. ∙ 0

Large scale visual understanding is challenging, as it requires a model to handle the widely-spread and imbalanced distribution of <subject, relation, object> triples. In real-world scenarios with large numbers of objects and relations, some are seen very commonly while others are barely seen. We develop a new relationship detection model that embeds objects and relations into two vector spaces where both discriminative capability and semantic affinity are preserved. We learn both a visual and a semantic module that map features from the two modalities into a shared space, where matched pairs of features have to discriminate against those unmatched, but also maintain close distances to semantically similar ones. Benefiting from that, our model can achieve superior performance even when the visual entity categories scale up to more than 80,000, with extremely skewed class distribution. We demonstrate the efficacy of our model on a large and imbalanced benchmark based of Visual Genome that comprises 53,000+ objects and 29,000+ relations, a scale at which no previous work has ever been evaluated at. We show superiority of our model over carefully designed baselines on Visual Genome, as well as competitive performance on the much smaller VRD dataset.



There are no comments yet.


page 1

page 3

page 14

page 15

page 16

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1: Groundtruth and top1 predicted relationships by our approach for an image in the Visual Genome test set. Bounding boxes are colored in pairs and their corresponding relationships are listed in the same colors. The number beside each relationship correspond to the number of times this triplet was seen in the training set. Our model is able to predict both commonly seen relationships such as “man, wearing, glasses” and rarely seen ones such as “dog, next to, woman”.

Scale matters. In real world, people tend to describe visual entities with open vocabulary, e.g., the raw ImageNet

[6] dataset has 21,841 synsets that covers even more object classes. It is even more open when it comes to relationships, since the combinations of subject, relation, object are orders of magnitude more than objects[22, 28, 37]. Moreover, the long-tailed distribution of objects could be an obstacle for a model to learn all classes sufficiently, and such challenge is exacerbated in relationship detection because either the subject, the object, or the relation could be infrequent, or their triple might be jointly infrequent. Figure 1 shows an example from the Visual Genome dataset, which contains commonly seen relationship (e.g., “man, wearing, glasses”) along with uncommon ones (e.g., “dog, next to, woman”).

The second challenge lies in the fact that object categories are often semantically associated[6, 17, 5], and such connections could be more subtle for relationships. For example, an image of “person ride horse” could look like one of “person ride elephant” since they both belong to the kind of relationships where a person is riding an animal, but “person carry bike” would look very different from “person ride bike” even though they have the same subject and object. It is critical for a model to be able to leverage such semantic connections.

In this work, we aim to study relationship detection at an unprecedented scale where the total number of visual entities is more than 80,000. We carefully develop a scalable approach that is semantically-guided and a loss that enables discriminative learning in such a large-scale setting. We use a continuous output space for objects and relations instead of discrete labels, which enables knowledge to be easily transferred from frequent classes to infrequent ones, and such ability stimulates accurate recognition of large amount of rarely seen relationships.

Meanwhile, we also show that in order to enable knowledge transfer in the continuous space and preserve discriminative power at the same time, the well-known multi-class logistic loss (i.e., the loss used in softmax models) fails in the former and triplet loss[16] (that is widely used when learning in continuous space) fails in the latter, while our triplet-softmax loss is competent to the task. On the Visual Genome (VG) dataset, we design an intuitively straightforward baseline and show our superior performance over it. We demonstrate our model’s competence over triplet loss on the whole dataset and over multi-class logistic loss on the long tail data. Our model also has very competitive performance on the much smaller Visual Relationship Detection (VRD) dataset.

[width=trim=0 .050pt 0 0,clip]fig/model_small.pdf

Figure 2: (a) Overview of the proposed approach. , ,

are the losses of subject, relation and object. Orange, purple and blue colors represent subject, relation, object, respectively. Grey rectangles are fully connected layers, which are followed by ReLU activations except the last ones,

i.e. missing , , . We share layer weights of the subject and object branches, i.e. missing and , .

2 Related Work

Our work is at the intersection between semantic-embedding models and visual relationship detection and hence we discuss both angles.

Visual Relationship Detection A large number of visual relationship detection approaches have emerged during the last couple of years. Almost all of them are based on a small vocabulary, e.g., 100 object and 70 relation categories from the VRD dataset[22], or a subset of VG with the most frequent object and relation categories (e.g., 200 object and 100 relation categories from [35]).

In one of the earliest works, Lu et al. [22] utilize the object detection output of an an R-CNN detector and leverage language priors from semantic word embeddings to fine-tune the likelihood of a predicted relationship. Very recently, Zhuang et al. [39] use language representations of the subject and object as “context” to derive a better classification result for the relation. However, similar to Lu et al. [22] their language representations are pre-trained. Unlike these approach, we fine-tune subject and object representations jointly and employ the interaction between branches also at an earlier stage before classification.

In [34], the authors employ knowledge distillation from a large Wikipedia-based corpus and get state-of-the-art results for the VRD [22] dataset. In ViP-CNN [19], the authors pose the problem as a classification task on limited classes and therefore cannot scale to the scenarios that we can cope with. In our model we exploit the training set concept co-occurrences at the relationship level to model such knowledge. Our approach directly targets the large category scale and is able to utilize semantic associations to compensate for infrequent classes, while at the same time achieves competitive performance in the smaller and constrained VRD [22] dataset.

Very recent approaches like [38, 28] target open-vocabulary for scene parsing and visual relationship detection, respectively. In [28], the related work closest to ours, the authors learn a CCA model on top of different combinations of the subject, object and union regions and train a Rank SVM. They however consider each relationship triplet as a class and learn it as a whole entity, thus cannot scale to our setting. Our approach embeds the three components of a relationship separately to the independent semantic spaces for object and relation, but implicitly learns connections between them via visual feature fusion and semantic meaning preservation in the embedding space.

Semantically Guided Visual Recognition. Another parallel category of vision and language tasks is known as zero-shot learning. In [11][24] and [31], word embedding language models (e.g., [23]) were adopted to represent class names as vectors and hence allow zero-shot recognition. For fine-grained objects like birds and flowers, several works adopted Wikipedia Articles to guide zero-shot recognition since they are easy-collect language and provides more signal for recognition (e.g.,  [9, 8, 1, 7]. However, for relations and actions, these methods are not designed with the capability of locating the objects or interacting objects for visual relations. Several approaches have been proposed to model the visual-semantic embedding in the context of the image-sentence similarity task (e.g., [16, 32, 10, 33, 12]). Most of them focused on leaning semantic connections between the two modalities, which we not only aim to achieve, but with a manner that does not sacrifice discriminative capability since our task is detection instead of similarity-based retrieval. In contrast, visual relations also has a structure subject, relation, object and we show in our results that proper design of a visual-semantic embedding architecture and loss is critical for good performance. We compare our approach against most of the aforementioned related works on common benchmarks in Section 4.

Language Representation. Representing language in an embedding space that captures semantics has been extensively studied in the NLP community since the seminal word2vec approach of Mikolov et al[23]. Such embeddings are based on co-occurences of words in a generic corpus like Wikipedia and have been shown to capture not only semantics but also encode semantic analogies in the embedding space [25]. Chollet [3] exploited multi-label annotations, i.e. label co-occurrences, to learn an embedding space that is able to capture the likelihood that any two labels would co-occur in a same picture. Yu et al[34]

model the probability

via training set statistics; due to sparsity for many relations, they also incorporate external (Wikipedia) sources for language modeling. Inspired by such approaches, and exploiting the rich annotations of the Visual Genome dataset [17], we explore and present multiple ways of creating meaningful embeddings that capture both linguistic semantics as well as relationship level co-occurrence statistics.

Note: in this paper we use “relation” to refer to what is also known as ‘predicate” in previous works, and “relationship” or “relationship triplet” to refer to a subject, relation, object tuple.

3 Method

The task of relationship detection naturally requires a model to have discriminative power among a set of categories. It is well known that softmax + multi-class logistic loss is the best choice when the categories are not too many and have little association among each other [18, 30, 14]. However, in the real-world setting these two assumptions are often not true, since people might describe things in unlimited ways with potentially similar meanings. In such scenarios, it is critical for a model to preserve semantic similarities, while not sacrifice discriminative power for that. In order to achieve that, our model adopts a two-way pipeline that maps images and labels to a shared embedding space. We use a new loss that is better than both softmax-based and triplet loss in the large-vocabulary setting. During testing, recognition is done by nearest neighbor search in the embedding space and is linear with the size of the vocabulary.

3.1 Visual Module

The design logic of our visual module is that a relation exists when its subject and object exist, but not vice versa. Namely, relation recognition is conditioned on subject and object, but object recognition is independent from relations. The main reason is that we want to learn embeddings for subject and object in a separate semantic space from the relation space. That is, we want to learn a mapping from visual feature space (which is shared among subject/object and relation) to the two separate semantic embedding spaces (for objects and relations). Therefore, involving relation features for subject/object embeddings would have the risk of entangling the two spaces. Following this logic, as shown in Figure 2 an image is fed into a CNN ( to of VGG16) to get a global feature map of the image, then the subject, relation and object features , , are ROI-pooled with the corresponding regions , , , each branch followed by two fully connected layers which output three intermediate hidden features , , . For the subject/object branch, we add another fully connected layer to get the visual embedding , and similarly for the object branch to get . For the relation branch, we apply a two-level feature fusion: we first concatenate the three hidden features , , and feed it to a fully connected layer to get a higher-level hidden feature , then we concatenate the subject and object embeddings and with and feed it to two fully connected layers to get the relation embedding .

There are several things worth highlighting. First, and are the counterparts of the and layers in VGG16, thus both of them are necessary in order to preserve the capacity of VGG16. Second, both concatenations in the relation branch are necessary and sufficient, as shown in Section 4.3. The first concatenation integrates relatively low level visual features intrinsically captured by the feature extractor, while the second fuses embeddings of subject and object with . It can be interpreted as integration of three higher level features. Third, unlike most previous works that concatenate spatial and visual features, we found that such concatenation barely helps in our setting, because spatial layouts of relationships are much more diverse due to the large scale of both instances and classes, which is also observed in Zhang et al. [37].

3.2 Semantic Module

On the semantic side, we feed word vectors of subject, relation and object labels into a small MLP of one or two layers which outputs the embeddings. As in the visual module, the subject and object branches share weights while the relation branch is independent. The purpose of this module is to map word vectors into an embedding space that are more discriminative than the raw word vector space while preserving semantic similarity. During training, we feed the ground-truth labels of each relationship triplet as well as labels of negative classes into the semantic module, as the following subsection describes; during testing, we feed the whole sets of object and relation labels into it for nearest neighbors searching among all the labels to get the top as our prediction.

A good word vector representation for object/relation labels is critical since it provides proper initialization that is easy to fine-tune on. We consider the following word vectors:

Pre-trained word2vec embeddings (wiki) We rely on the pre-trained word embeddings provided by [23] which are widely used in prior work. We use this embedding as a baseline, and show later that by combining with other embeddings we achieve better discriminative ability.

Relationship-level co-occurrence embeddings (relco). This embedding encodes co-occurrence in the relationship level. By considering each relationship triplet seen in the training set as the “sentence” or “document” in a word2vec setting, each relation will have as context the subjects and object it usually appears with. Similarly, relations that tend to appear with a subject or object are used as context for learning their embeddings. We train a skip-gram word2vec model that tries to maximize classification of a word based on another word in the same context. As is in our case we define context via our training set’s relationships, we effectively learn to maximize the likelihoods of as well as and . Although maximizing is directly optimized in [34], we achieve similar results by reducing it to a skip-gram model and enjoy the scalability of a word2vec approach.

Node2vec embeddings (node2vec). As the Visual Genome dataset further provides image-level relation graphs, we also experimented with training node2vec embeddings as in [13]. These are effectively also word2vec embeddings, but the context is determined by random walks on a graph. In this setting, nodes correspond to subjects, objects and relations from the training set and edges are directed from and from for every image-level graph. This embedding can be seen as an intermediate between image-level and relationship level co-occurrences, with proximity to the one or the other controlled via the length of the random walks.

3.3 Training Loss

To learn the joint visual and semantic embedding we employ a modified triplet loss. Traditional triplet loss [16] encourages matched embeddings from the two modalities to be closer than the mismatched ones by a fixed margin, while our version tries to maximize this margin in a softmax form. In this subsection we review the traditional triplet loss and then introduce our triplet-softmax loss in a comparable fashion. To this end, we denote the two sets of triplets for each positive visual-semantic pair by :


where , and the two sets correspond to triplets with negatives from the visual and semantic space, respectively.

Triplet loss. If we omit the superscripts for clarity, the triplet loss for each branch is summation of two losses and :


where is the number of positive ROIs, is the number of negative samples per positive ROI, is the margin between the distances of positive and negative pairs, and is a similarity function.

We can observe from Equation (3) that as long as the similarity between positive pairs is larger than that between negative ones by , will be zero, and thus will return zero for that part. That means, during training once the margin is pushed to be larger than , the model will stop learning anything from that triplet. Therefore, it is highly likely to end up with an embedding space where points are not discriminative enough for a classification-oriented task.

Triplet-Softmax loss. The issue of triplet loss mentioned above can be alleviated by applying softmax top of each triplet, i.e. missing:



is the same similarity function (we use cosine similarity in this paper). All the other notations are the same as above. For each positive pair

and its corresponding set of negative pairs

, we calculate similarities between each of them and put them into a softmax layer followed by multi-class logistic loss so that the similarity of positive pairs would be pushed to be

, and otherwise. Compared to triplet loss, this loss always tries to enlarge the margin to its largest possible value (i.e. missing, 1), thus has more discriminative power than the traditional triplet loss.

It is worth noting that although theoretically traditional triplet loss can pushes the margin as much as possible when , most previous works (e.g.,  [16, 32, 10]) adopted a small to allow slackness during training. It is also unclear how to determine the exact value of given a specific task. We follow previous works and set in all of our experiments.

Visual Consistency loss. To further force the embeddings to be more discriminative, we add a loss that pulls closer the samples from the same category while pushes away those from different categories, i.e.:


where is the number of positive ROIs, is the set of positive ROIs in the same class of , is the number of negative samples per positive ROI and is the margin between the distances of positive and negative pairs. The interpretation of this loss is: the minimum similarity between samples from the same class should be larger than any similarity between samples from different classes by a margin. Here we utilize the traditional triplet loss format since we want to introduce slackness between visual embeddings to prevent embeddings from collapsing to the class centers.

Empirically we found it the best to use triplet-softmax loss for while using triplet loss for . The reason is similar with that of the visual consistency loss: mode collapse should be prevented by introducing slackness. On the other hand, there is no such issue for since each label is a mode by itself, and we encourage all modes of to be separated from each other. In conclusion, our final loss is:


where we found that works reasonably well for all scenarios.

Connection between Softmax and Triplet-Softmax. Our triplet-softmax module can be interpreted as a variant of a regular softmax. Specifically, for softmax we have:


where x is the input feature to the last layer, Y is the learned weights of that layer, and is the -th column of Y. For triplet-softmax we have:


where is normalized vector of x, is the input word vector to the semantic module , and its output is which is also normalized by default. It is clear that triplet-softmax can be seen as a special softmax whose last layer weights is provided by the semantic module’s output instead of learned independently within the visual module. Such structure with semantic guidance before softmax is the very reason that our model is both discriminative and similarity-preserving.

4 Experiments

Datasets We present experiments on two datasets, the Visual Genome (VG) [17] and Visual Relationship Detection (VRD) dataset [22].

  • VRD The VRD dataset [22] contains 5,000 images with 100 object categories and 70 relations. In total, VRD contains 37,993 relation annotations with 6,672 unique relations and 24.25 relationships per object category. We follow the same train/test split as in [22] to get 4,000 training images and 1,000 test images. We use this dataset to demonstrate that our model can work reasonably well on small dataset with small category space, even though it is designed for large-scale settings.

  • VG80k We use the latest release Visual Genome dataset (VG v1.4) [17] that contains images with relationships on average per image. Each relationship is of the form subject, relation, object with annotated subject and object bounding boxes. We follow [15] and split the data into training images and testing images. Since text annotations of VG are noisy, we first clean it by removing non-alphabet characters and stop words, and use the autocorrect library to correct spelling. Following that, we check if all words in an annotation exist in the word2vec dictionary [23] and remove those that don’t. If an image ends up with all annotations being removed, we delete it from the corresponding image set. Furthermore, we apply two rules to remove obvious annotation noise: 1) merge: we remove duplicate consecutive words since they are clearly noises, such as “horse horse” and “banana banana”. We also merge phrases with the same set of words and the same meaning, such as “man smiling” and “smiling man”. Specifically, we select the one with more occurrence in the training set and replace the other with it. 2) join: we consider phrases with the same letters in the same order as one phrase, such as “surfboard” and “surf board”. Similarly, we select the one that occurs more and replace the other with it. We run this cleaning process on both training and testing set and end up with training images and testing images, with object categories and relation categories. We further split the training set into training and validation images.***We will release the cleaned annotations along with our code.

Evaluation protocol. We evaluate all methods on the whole object and relation categories. We use ground-truth boxes as relationship proposals, meaning there is no localization errors and the results directly reflect recognition ability of a model. We use the following metrics to measure performance: (1) top1, top5, and top10 accuracy, (2) mean reciprocal ranking (rr), defined as , (3) mean ranking (mr), defined as , smaller is better.

For VRD, we use the same evaluation metrics used in 

[34], which reports recall rates using the top 50 and 100 relationship predications, with relations per relationship proposal before taking the top 50 and 100 predictions.

Implementation details. For both VG and VRD datasets, we train our model for epochs using 8 GPUs. We set learning rate as for the first epochs and for the rest epochs. We initialize the convolution layers and subject/object branch with weights pre-trained on COCO [21]. The subject and object branches share weights, while the relation branch is initialized randomly, since knowledge on object recognition from COCO is not transferable to the branch that predicts relations. For the word vectors, we used the gensim library [29] for both word2vec and node2vec [13]. For the triplet loss, we set for all experiments.

There is a critical factor that significantly affects our triplet-softmax loss. Since we use cosine similarity, is equivalent to dot product of two normalized vectors as shown in Section 3.3. We empirically found that simply feeding normalized vector could cause gradient vanishing problem, since gradients are divided by the norm of input vector when back-propagated. This is also observed in Bell et al. [2] where it is necessary to scale up normalized vectors for successful learning. Similar with [2], we set the scalar to a value that is close to the mean norm of the input vectors and multiply before feeding to the softmax layer. We set the scalar to for VG80k and for VRD in all experiments.

ROI Sampling. One of the critical things that powers Fast-RCNN is the well-designed ROI sampling during training. It ensures that for most ground-truth boxes, each of them has positive ROIs and negative ROIs, where positivity is defined as overlap IoU . In our setting, ROI sampling is similar for the subject/object branch, while for the relation branch, positivity is defined as both subject and object IoUs . Accordingly, we sample subject ROIs with unique positives and unique negatives, and do the same thing for object ROIs. Then we pair all the subject ROIs with object ROIs to get ROI pairs as relationship candidates. For each candidate, if both ROIs’ IoU we mark it as positive, otherwise negative. We finally sample positive and negative relation candidates and use the union of each ROI pair as a relation ROI. In this way we end up with a consistent number of positive and negative ROIs for relation branch as subject and object branches.

max width=1center Relationship Phrase Relationship Detection Phrase Detection free k k = 1 k = 10 k = 70 k = 1 k = 10 k = 70 Recall at 50 100 50 100 50 100 50 100 50 100 50 100 50 100 50 100 w/ proposals from [22] CAI*[39] 15.63 17.39 17.60 19.24 - - - - - - - - - - - - Language cues[28] 16.89 20.70 15.08 18.37 - - 16.89 20.70 - - - - 15.08 18.37 - - VRD[22] 17.43 22.03 20.42 25.52 13.80 14.70 17.43 22.03 17.35 21.51 16.17 17.03 20.42 25.52 20.04 24.90 Ours 19.18 22.64 21.69 25.92 16.08 17.07 19.18 22.64 18.89 22.35 18.32 19.78 21.69 25.92 21.39 25.65 w/ better proposals DR-Net*[4] 17.73 20.88 19.93 23.45   - - - - - - - - - - - - ViP-CNN[19] 17.32 20.01 22.78 27.91 17.32 20.01 - - - - 22.78 27.91 - - - - VRL[20] 18.19 20.79 21.37 22.60 18.19 20.79 - - - - 21.37 22.60 - - - - PPRFCN*[36] 14.41 15.72 19.62 23.75 - - - - - - - - - - - - VTransE* 14.07 15.20 19.42 22.42 - - - - - - - - - - - - SA-Full*[26] 15.80 17.10 17.90 19.50 - - - - - - - - - - - - CAI*[39] 20.14 23.39 23.88 25.26 - - - - - - - - - - - - KL distilation[34] 20.12 28.94 22.59 25.54 16.57 17.69 19.92 27.98 20.12 28.94 19.15 19.98 22.95 25.16 22.59 25.54 Ours 20.02 24.27 23.61 28.97 18.62 20.36 20.02 24.27 20.01 24.26 22.17 24.48 23.61 28.96 23.59 28.97

Table 1: Results on the VRD[22] dataset ( means unavailable / unknown)

4.1 Comparison on VRD

We first validate our model on the small VRD dataset with comparison to state-of-the-art methods using the metrics presented in [34] in Table 1. Note that there is a variable in this metric which is the number of relation candidates when selecting top50/100. Since not all previous methods specified in their evaluation, we first report performance in the “free ” column when considering as a hyper-parameter that can be cross-validated. For methods where the is reported for 1 or more values, the column reports the performance using the best . We then list all available results with specific in the right two columns.

Another inconsistent factor among related works is the object proposals used. The only consistent known setting is when approaches use the publicly available proposals provided by [22], like [22, 28, 39]. Other methods either train an offline Fast/Faster-RCNN detector on VRD to extract boxes  [4, 20, 36, 35, 27, 39, 34], or use the built-in RPN modules to generates proposals on-the-fly [19].

For fairness, we split the table in two parts. The top part lists methods that use the same proposals from [22], while the bottom part lists methods that are based on a different set of proposals, and ours uses better proposals obtained from Faster-RCNN as previous works. We can see that we outperform all other methods with proposals from [22] even without using message-passing-like post processing as in [19, 4], and also very competitive to the overall best performing method from [34]. Note that although spatial features could be advantageous for VRD according to previous methods, we do not use them in our model in concern of large-scale settings as explained in Section 3.1. We expect better performance if integrating spatial features for VRD, but for model consistency we do experiments without it everywhere.

max width=0.9center Relationship Triplet Relation top1 top5 top10 rr mr top1 top5 top10 rr mr All classes 3-branch Fast-RCNN 9.73 41.95 55.19 52.10 16.36 36.00 69.59 79.83 50.77 7.81 ours w/ triplet 8.01 27.06 35.27 40.33 32.10 37.98 61.34 69.60 48.28 14.12 ours w/ softmax 14.53 46.33 57.30 55.61 16.94 49.83 76.06 82.20 61.60 8.21 ours final 15.72 48.83 59.87 57.53 15.08 52.00 79.37 85.60 64.12 6.21 Tail classes 3-branch Fast-RCNN 0.32 3.24 7.69 24.56 49.12 0.91 4.36 9.77 4.09 52.19 ours w/ triplet 0.02 0.29 0.58 7.73 83.75 0.12 0.61 1.10 0.68 86.60 ours w/ softmax 0.00 0.07 0.47 20.36 58.50 0.00 0.08 0.55 1.11 65.02 ours final 0.48 13.33 28.12 43.26 5.48 0.96 7.61 16.36 5.56 45.70

Table 2: Results on all relation classes and tail classes () in VG80k. Note that since VG80k is extremely imbalanced, classes with no greater than 1024 occurrences are still in the tail. In fact, there are more than 99% of relation classes but only 10.04% instances of these classes that occur for no more than 1024 times.
(a) Top 5 rel triplet
(b) Top 5 relation
Figure 3: Top-5 relative accuracies against the 3-branch Fast-RCNN baseline in the tail intervals. The intervals are defined as bins of 32 from 1 to 1024 occurrences of the relation classes.

4.2 Relationship Recognition on VG80k

Baselines. Since there is no previous method that works in our large-scale setting, we carefully design 3 baselines to compare with. 1) 3-branch Fast-RCNN: an intuitively straightforward model is a Fast-RCNN with a shared to backbone and 3 branches for subject, relation and object respectively, where the subject and object branches share weights since they are essentially an object detector; 2) our model with softmax loss: we replace our loss with softmax loss; 3) our model with triplet loss: we replace our loss with triplet loss.

Results. As shown in Table 2, we can see that our loss is the best for the general case where all instances from all classes are considered. The baseline has reasonable performance but is clearly worse than ours with softmax, demonstrating that our visual module is critical for efficient learning. Ours with triplet is worse than ours with softmax in the general case since triplet loss is not discriminative enough among the massive data. However it is the opposite for tail classes (i.e.,

), since recognition of infrequent classes can benefit from the transferred knowledge learn for frequent classes, which the softmax-based model is not capable of. Another observation is that although the 3-branch Fast-RCNN baseline works poorly in the general case, it is better than our model with softmax. Since the main difference of them is with and without visual feature concatenation, it means that integrating subject and object features does not necessarily helps infrequent relation classes. This is because subject and object features could lead to strong constraint on the relation, resulting in lower chance of predicting infrequent relation when using softmax. For example, when seeing a rare image where the relationship is “dog ride horse”, subject being “dog” and object being “horse” would give very little probability to the relation “ride”, even though it is the correct answer. Our model alleviates this problem by not mapping visual features directly to the discrete categorical space, but to a continuous embedding space where visual similarity is preserved. Therefore, when seeing the visual features of “dog”, “horse” and the whole “dog ride horse” context, our model is able to associate them with a visually similar relationship “person ride horse” and correctly output the relation “ride”.

4.3 Ablation Study

4.3.1 Variants of our model

max width=0.8center Relationship Triplet Relation Methods top1 top5 top10 rr mr top1 top5 top10 rr mr wiki 15.59 46.03 54.78 52.45 25.31 51.96 78.56 84.38 63.61 8.61 relco 15.58 46.63 55.91 54.03 22.23 52.00 79.06 84.75 63.90 7.74 wiki + relco 15.72 48.83 59.87 57.53 15.08 52.00 79.37 85.60 64.12 6.21 wiki + node2vec 15.62 47.58 57.48 54.75 20.93 51.92 78.83 85.01 63.86 7.64 0 sem layer 11.21 28.78 34.84 38.64 43.49 44.66 60.06 64.74 51.60 24.74 1 sem layer 15.75 48.23 58.28 55.70 19.15 51.82 78.94 85.00 63.79 7.63 2 sem layer 15.72 48.83 59.87 57.53 15.08 52.00 79.37 85.60 64.12 6.21 3 sem layer 15.49 48.42 58.75 56.98 15.83 52.00 79.19 85.08 63.99 6.40 no concat 10.47 42.51 54.51 51.51 20.16 36.96 70.44 80.01 51.62 9.26 early concat 15.09 45.88 55.72 54.72 19.69 49.54 75.56 81.49 61.25 8.82 late concat 15.57 47.72 58.05 55.34 19.27 51.06 78.15 84.47 63.03 7.90 both concat 15.72 48.83 59.87 57.53 20.62 52.00 79.37 85.60 64.12 6.21 15.21 47.28 57.77 55.06 19.12 50.67 78.21 84.70 62.82 7.31 + 15.07 47.37 57.85 54.92 19.59 50.60 78.06 84.40 62.71 7.60 + 15.53 47.97 58.49 55.78 18.55 51.48 78.99 84.90 63.59 7.32 + + 15.72 48.83 59.87 57.53 15.08 52.00 79.37 85.60 64.12 6.21

Table 3: Ablation study of our model on VG80k.

We explore variants of our model in 4 dimensions: 1) the semantic embeddings fed to the semantic module; 2) structure of the semantic module; 3) structure of the visual module; 4) the losses. The default settings of them are 1) using wiki + relco; 2) 2 semantic layer; 3) with both visual concatenation; 4) with all the 3 loss terms. We fix the other 3 dimensions as the default settings when exploring one of them.

Which semantic embedding to use? We explore 4 settings: 1) wiki and 2) relco use wikipedia and relationship-level co-occurrence embedding alone, while 3) wiki + relco and 4) wiki + node2vec use concatenation of two embeddings. The intuition of concatenating wiki with relco and node2vec is that wiki contains common knowledge acquired outside of the dataset, while relco and node2vec are trained specifically on VG80k, and their combination provides abundant information to initialize the semantic module. As shown in Table 3, fusion of wiki and relco outperforms each one alone with clear margins. We found that using node2vec alone does not perform reasonably, but wiki + node2vec is competitive to others, demonstrating the efficacy of concatenation.

Number of semantic layers. We also study how many, if any, layers are necessary to embed the word vectors. As it is shown in Table 3, directly using the word vectors (0 semantic layers) is not a good substitute of our learned embedding; raw word vectors are learned to represent as much associations between words as possible, but not to distinguish them. We find that either 1 or 2 layers give similarly good results and 2 layers are slightly better, though performance starts to degrade when adding more layers.

Are both visual feature concatenations necessary? In Table 3, “early concat” means using only the first concatenation of the three branches, and “late concat” means the second. Both early and late concatenation boost performance significantly compared to no concatenation, and it is the best with both. Another observation is that late concatenation is better than early alone. We believe the reason is, as mentioned in Section 4.2 as well, relations are naturally conditioned on and constrained by subjects and objects Since late concatenation is at a higher level, it integrates features that are more semantically close to the subject and object labels, which gives stronger prior to the relation branch and affects relation prediction more than the early concatenation.

Do all the losses help? In order to understand how each loss term helps training, we trained 3 models of which each excludes one or two loss terms. We can see that using is similar with , and it is the best with all the three losses. This is because pulls positive pairs close while pushes negative away. However, since is a many-to-one mapping (i.e., multiple visual features could have the same label), there is no guarantee that the set of with the same would be embedded closely, if not using . By introducing , with the same are forced to be close to each other, and thus the structural consistency of visual features is preserved.

4.3.2 The margin m in triplet loss

We show results of triplet loss with various values for the margin in Table 4. As described in Section 3.3 in the paper, this value allows slackness in pushing negative pairs away from positive ones. We observe similar results with previous works [16, 32, 10] that it is the best to set or in order to achieve optimal performance. It is clear that triplet loss is not able to learn discriminative embeddings that are suitable for classification tasks, even with larger that can theoretically enforce more contrast against negative labels. We believe that the main reason is that in a hinge loss form, triplet loss treats all negative pairs equally “hard” as long as they are within the margin . However, as shown by the successful softmax models, “easy” negatives (e.g., those that are close to positives) should be penalized less than those “hard” ones, which is a property our model has since we utilize softmax for contrastive training.

max width=0.8center Relationship Triplet Relation m = top1 top5 top10 rr mr top1 top5 top10 rr mr 0.1 7.77 29.84 38.53 42.29 28.13 36.50 63.50 70.20 47.48 14.20 0.2 8.01 27.06 35.27 40.33 32.10 37.98 61.34 69.60 48.28 14.12 0.3 5.78 24.39 33.26 37.03 34.55 36.75 58.65 64.86 46.62 20.62 0.4 3.82 22.55 31.70 34.10 36.26 34.89 57.25 63.74 45.04 21.89 0.5 3.14 19.69 30.01 31.63 38.25 33.65 56.16 62.77 43.88 23.19 0.6 2.64 15.68 27.65 29.74 39.70 32.15 55.08 61.68 42.52 24.25 0.7 2.17 11.35 24.55 28.06 41.47 30.36 54.20 60.60 41.02 25.23 0.8 1.87 8.71 16.30 26.43 43.18 29.78 53.43 60.01 40.29 26.19 0.9 1.43 7.44 11.50 24.76 44.83 28.35 51.73 58.74 38.89 27.27 1.0 1.10 6.97 10.51 23.57 46.60 27.49 50.72 58.10 37.97 28.13

Table 4: Performances of triplet loss on VG80k with different values of margin m. We use margin for all our experiments in the main paper.

4.3.3 The scaling factor before softmax

As mentioned in the implementation details of Section 4 in the paper, this value scales up the output by a value that is close to the average norm of the input and prevents gradient vanishing caused by the normalization. Specifically, for Eq(7) in the paper we use where is the scaling factor. In Table 5 we show results of our model when changing the value of the scaling factor applied before the softmax layer. We observe that when the value is close to the average norm of all input vectors (i.e., 5.0), we achieve optimal performance, although slight difference of this value does not change results too much (i.e., when it is 4.0 or 6.0). It is clear that when the scaling factor is 1.0, which is equivalent to training without scaling, the model is not sufficiently trained. We therefore pick 5.0 for this scaling factor for all the other experiments on VG80k.

max width=0.8center Relationship Triplet Relation = top1 top5 top10 rr mr top1 top5 top10 rr mr 1.0 0.00 0.61 3.77 22.43 48.24 0.04 1.12 5.97 4.11 21.39 2.0 8.48 27.63 34.26 35.25 46.28 44.94 70.60 76.63 56.69 13.20 3.0 14.19 39.22 46.71 48.80 29.65 51.07 74.61 78.74 61.74 10.88 4.0 15.72 47.19 56.94 54.80 20.85 51.67 78.66 84.23 63.53 8.68 5.0 15.72 48.83 59.87 57.53 15.08 52.00 79.37 85.60 64.12 6.21 6.0 15.32 47.99 58.10 55.57 18.67 51.60 78.95 85.05 63.62 7.23 7.0 15.11 44.72 54.68 54.04 20.82 51.23 77.37 83.37 62.95 7.86 8.0 14.84 45.12 54.95 54.07 20.56 51.25 77.67 83.36 62.97 7.81 9.0 14.81 45.72 55.81 54.29 20.10 50.88 78.59 84.70 63.08 7.21 10.0 14.71 45.62 55.71 54.19 20.19 51.07 78.64 84.78 63.21 7.26

Table 5: Performances of our model on VG80k with different values of the scaling factor. We use scaling factor for all our experiments on VG80k in the main paper.

4.3.4 Performance trend analysis

Similar with Figure 3 in the paper, in Figure 4 we show top1/top10 and reciprocal rank/mean rank performances as the relation class frequency increases in a log-linear scale. For reciprocal rank and mean rank, we compute them based on the top 250 predictions from each model. This gives a wider view of each model by looking at what position a model ranks the correct answer at. It is clear that our model is superior over baselines in those infrequent classes under the reciprocal/mean rank metric.

(a) Top 1 rel triplet
(b) Top 1 relation
(c) Top 5 rel triplet
(d) Top 5 relation
(e) Top 10 rel triplet
(f) Top 10 relation
(g) Reciprocal Rank rel triplet
(h) Reciprocal Rank relation
(i) Mean Rank rel triplet
(j) Mean Rank relation
Figure 4: (a)-(f) Top-1,5,10 accuracies and (e)-(h) Reciprocal Rank and Mean Rank of relationship triplets and relations as the relation class frequency increases in a log-linear scale.

4.3.5 Qualitative results

The VG80k has densely annotated relationships for most images with a wide range of types. In Figure 5 and Figure 6 there are interactive relationships such as “boy flying kite”, “batter holding bat”, positional relationships such as “glass on table”, “man next to man”, attributive relationships such as “man in suit” and “boy has face”. Our model is able to cover all these kinds, no matter frequent or infrequent, and even for those incorrect predictions, our answers are still semantic meaningful and similar to the ground-truth, e.g., the ground-truth “lamp on pole” v.s. the predicted “light on pole”, and the ground-truth “motorcycle on sidewalk” v.s. the predicted “scooter on sidewalk”.

Figure 5: Qualitative results. Notations are the same with Figure 1 in the paper. Our model recognizes a wide range of relationships. Even if they are not always matching the ground truth they are frequently correct or at least reasonable as the ground truth is not complete.
Figure 6: Qualitative results (continued). Notations are the same with Figure 1 in the paper. Our model recognizes a wide range of relationships. Even if they are not always matching the ground truth they are frequently correct or at least reasonable as the ground truth is not complete.
Figure 7: Qualitative results (continued). Notations are the same with Figure 1 in the paper. Our model recognizes a wide range of relationships. Even if they are not always matching the ground truth they are frequently correct or at least reasonable as the ground truth is not complete.

5 Conclusions

We propose a framework that can scale to nearly tens of thousands of visual entity categories. In our extensive large scale evaluation, we find that it is crucial to integrate subject and object features at multiple levels for good relation embeddings. We further design a loss that learns to embed visual and semantic features into a shared space, where semantic correlations between categories are kept without hurting discriminative ability. Experiments show that our model is superior compared to strong baselines and ablations for massive large scale relationship detection. Future work includes introducing a relationship proposal module to complete the detection pipeline, and integrating this module with the current framework to make it end-to-end trainable.

6 Acknowledgements

This research is partially funded by a gift from Facebook AI Research and National Science Foundation (NSF) under NSF-USA award #1409683. We want to thank Laurens van der Maaten for his tremendous assistance and suggestions that greatly improved this work. We also thank Priya Goyal for sharing her code with us which our work is based on.


  • [1] J. Ba, K. Swersky, S. Fidler, and R. Salakhutdinov.

    Predicting deep zero-shot convolutional neural networks using textual descriptions.

    In ICCV, 2015.
  • [2] S. Bell, C. L. Zitnick, K. Bala, and R. Girshick.

    Inside-outside net: Detecting objects in context with skip pooling and recurrent neural networks.

    Computer Vision and Pattern Recognition (CVPR), 2016.
  • [3] F. Chollet. Information-theoretical label embeddings for large-scale image classification. arXiv preprint arXiv:1607.05691, 2016.
  • [4] B. Dai, Y. Zhang, and D. Lin. Detecting visual relationships with deep relational networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3298–3308. IEEE, 2017.
  • [5] J. Deng, N. Ding, Y. Jia, A. Frome, K. Murphy, S. Bengio, Y. Li, H. Neven, and H. Adam. Large-scale object classification using label relation graphs. In European conference on computer vision, pages 48–64. Springer, 2014.
  • [6] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In CVPR. IEEE, 2009.
  • [7] M. Elhoseiny, S. Cohen, W. Chang, B. Price, and A. Elgammal. Sherlock: Scalable fact learning in images. In AAAI, 2017.
  • [8] M. Elhoseiny, A. Elgammal, and B. Saleh.

    Write a classifier: Predicting visual classifiers from unstructured text descriptions.

    TPAMI, 2016.
  • [9] M. Elhoseiny, B. Saleh, and A. Elgammal. Write a classifier: Zero-shot learning using purely textual descriptions. In ICCV, 2013.
  • [10] F. Faghri, D. J. Fleet, J. R. Kiros, and S. Fidler. Vse++: Improved visual-semantic embeddings. arXiv preprint arXiv:1707.05612, 2017.
  • [11] A. Frome, G. S. Corrado, J. Shlens, S. Bengio, J. Dean, T. Mikolov, et al. Devise: A deep visual-semantic embedding model. In NIPS, 2013.
  • [12] Y. Gong, Q. Ke, M. Isard, and S. Lazebnik. A multi-view embedding space for modeling internet images, tags, and their semantics. IJCV, 2014.
  • [13] A. Grover and J. Leskovec. node2vec: Scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining, pages 855–864. ACM, 2016.
  • [14] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  • [15] J. Johnson, A. Karpathy, and L. Fei-Fei.

    Densecap: Fully convolutional localization networks for dense captioning.

    In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016.
  • [16] R. Kiros, R. Salakhutdinov, R. S. Zemel, and et al. Unifying visual-semantic embeddings with multimodal neural language models. TACL, 2015.
  • [17] R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. Kalantidis, L.-J. Li, D. A. Shamma, et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision, 123(1):32–73, 2017.
  • [18] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
  • [19] Y. Li, W. Ouyang, and X. Wang. Vip-cnn: A visual phrase reasoning convolutional neural network for visual relationship detection. CVPR, 2017.
  • [20] X. Liang, L. Lee, and E. P. Xing. Deep variation-structured reinforcement learning for visual relationship and attribute detection. arXiv preprint arXiv:1703.03054, 2017.
  • [21] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft coco: Common objects in context. In ECCV. Springer, 2014.
  • [22] C. Lu, R. Krishna, M. Bernstein, and L. Fei-Fei. Visual relationship detection with language priors. In ECCV, 2016.
  • [23] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In NIPS, 2013.
  • [24] M. Norouzi, T. Mikolov, S. Bengio, Y. Singer, J. Shlens, A. Frome, G. S. Corrado, and J. Dean. Zero-shot learning by convex combination of semantic embeddings. In ICLR, 2014.
  • [25] J. Pennington, R. Socher, and C. D. Manning. Glove: Global vectors for word representation. EMNLP, 2014.
  • [26] J. Peyre, I. Laptev, C. Schmid, and J. Sivic.

    Weakly-supervised learning of visual relations.

    In ICCV, 2017.
  • [27] J. Peyre, I. Laptev, C. Schmid, and J. Sivic. Weakly-supervised learning of visual relations. ICCV, 2017.
  • [28] B. A. Plummer, A. Mallya, C. M. Cervantes, J. Hockenmaier, and S. Lazebnik. Phrase localization and visual relationship detection with comprehensive image-language cues. In ICCV, 2017.
  • [29] R. Řehůřek and P. Sojka. Software Framework for Topic Modelling with Large Corpora. In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, pages 45–50, Valletta, Malta, May 2010. ELRA.
  • [30] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, 2015.
  • [31] R. Socher, M. Ganjoo, H. Sridhar, O. Bastani, C. D. Manning, and A. Y. Ng. Zero shot learning through cross-modal transfer. In NIPS, 2013.
  • [32] I. Vendrov, R. Kiros, S. Fidler, and R. Urtasun. Order-embeddings of images and language. ICLR, 2016.
  • [33] L. Wang, Y. Li, and S. Lazebnik. Learning deep structure-preserving image-text embeddings. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
  • [34] R. Yu, A. Li, V. I. Morariu, and L. S. Davis. Visual relationship detection with internal and external linguistic knowledge distillation. In The IEEE International Conference on Computer Vision (ICCV), 2017.
  • [35] H. Zhang, Z. Kyaw, S.-F. Chang, and T.-S. Chua. Visual translation embedding network for visual relation detection. arXiv preprint arXiv:1702.08319, 2017.
  • [36] H. Zhang, Z. Kyaw, J. Yu, and S.-F. Chang. Ppr-fcn: Weakly supervised visual relation detection via parallel pairwise r-fcn. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4233–4241, 2017.
  • [37] J. Zhang, M. Elhoseiny, S. Cohen, W. Chang, and A. Elgammal. Relationship proposal networks. In CVPR, volume 1, page 2, 2017.
  • [38] H. Zhao, X. Puig, B. Zhou, S. Fidler, and A. Torralba. Open vocabulary scene parsing. arXiv preprint arXiv:1703.08769, 2017.
  • [39] B. Zhuang, L. Liu, C. Shen, and I. Reid. Towards context-aware interaction recognition for visual relationship detection. In The IEEE International Conference on Computer Vision (ICCV), Oct 2017.