Recovering the Unbiased Scene Graphs from the Biased Ones

07/05/2021 ∙ by Meng-Jiun Chiou, et al. ∙ National University of Singapore ByteDance Inc. 9

Given input images, scene graph generation (SGG) aims to produce comprehensive, graphical representations describing visual relationships among salient objects. Recently, more efforts have been paid to the long tail problem in SGG; however, the imbalance in the fraction of missing labels of different classes, or reporting bias, exacerbating the long tail is rarely considered and cannot be solved by the existing debiasing methods. In this paper we show that, due to the missing labels, SGG can be viewed as a "Learning from Positive and Unlabeled data" (PU learning) problem, where the reporting bias can be removed by recovering the unbiased probabilities from the biased ones by utilizing label frequencies, i.e., the per-class fraction of labeled, positive examples in all the positive examples. To obtain accurate label frequency estimates, we propose Dynamic Label Frequency Estimation (DLFE) to take advantage of training-time data augmentation and average over multiple training iterations to introduce more valid examples. Extensive experiments show that DLFE is more effective in estimating label frequencies than a naive variant of the traditional estimate, and DLFE significantly alleviates the long tail and achieves state-of-the-art debiasing performance on the VG dataset. We also show qualitatively that SGG models with DLFE produce prominently more balanced and unbiased scene graphs.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 3

page 7

page 8

page 12

page 17

Code Repositories

recovering-unbiased-scene-graphs

Official implementation of "Recovering the Unbiased Scene Graphs from the Biased Ones" (ACMMM 2021)


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Scene graph generation (SGG) (Lu et al., 2016) aims to predict visual relationships in the form of (subject-predicate-object

) among salient objects in images. SGG has been shown to be helpful for image captioning

(Yao et al., 2018; Yang et al., 2019; Li and Jiang, 2019), visual question answering (Teney et al., 2017; Shi et al., 2019)

, indoor scene understanding

(Armeni et al., 2019; Chiou et al., 2020) and thus has been drawing increasing attention (Zellers et al., 2018; Yang et al., 2018; Herzig et al., 2018; Chen et al., 2019b, c; Gu et al., 2019; Chen et al., 2019a; Dornadula et al., 2019; Zareian et al., 2020a; Khademi and Schulte, 2020; Tang et al., 2020; Lin et al., 2020; Wang et al., 2020a; Yan et al., 2020; Knyazev et al., 2020; He et al., 2020; Wang et al., 2020b; Zareian et al., 2020b; Sharifzadeh et al., 2020; Yu et al., 2020; Ren et al., 2021; Wei et al., 2020; Hung et al., 2020; Tian et al., 2020; Yuren et al., 2020; Chiou et al., 2021b, a).

Figure 1. An illustrative comparison between the traditional, biased inference and the unbiased PU inference for SGG. (a) Traditionally SGG models are not trained in the PU setting and thus output biased probabilities in favor of conspicuous classes (e.g., on). (b) We remove the reporting bias from the biased probabilities by discounting the difference in the chance of being labeled, i.e., label frequency, so that inconspicuous classes (e.g., parked on) are properly predicted.

The long tail problem is common and challenging in SGG (Tang et al., 2020): since certain predicates (i.e., head classes) occur far more frequently than others (i.e., tail classes) in the most widely-used VG dataset (Krishna et al., 2017), a model that trained with this unbalanced dataset would favor predicting heads against tails. For instance, the number of training examples of on is 830 higher than that of painted on in the VG dataset, and (given ground truth objects) a classical SGG model MOTIFS (Zellers et al., 2018) achieves 74.3 Recall@20 for on, in sharp contrast to 0.0 for painted on. However, the fact that the head classes are less descriptive than the tail classes makes the generated scene graphs coarse-grained and less informative, which is not ideal.

Most of the existing efforts in long-tailed SGG (Chen et al., 2019c; Tang et al., 2020; Yan et al., 2020; Wang et al., 2020a; Wei et al., 2020; He et al., 2020)

deal with the skewed class distribution directly. However, unlike common long-tailed classification tasks where the long tails are mostly caused by the unbalanced class prior distributions, the long tail of SGG with the VG dataset is significantly affected by the imbalance of missing labels, which remains unsolved. The missing label problem arises as it is unrealistic to annotate the overwhelming number of possible visual relations (

i.e., possibilities given predicate classes and objects in an image). Training SGG models by treating all unlabeled pairs as negative examples (which is the default setting for most of the existing SGG works) introduces missing label bias in predictions, i.e., predicted probabilities could be under-estimated. What is worse, reporting bias (Misra et al., 2016; Tang et al., 2020) which is prevalent in the VG dataset causes an imbalance in the missing labels of different predicates. That is, the conspicuous classes (e.g., on, in) are more likely to be annotated than the inconspicuous ones (e.g., parked on, covered in). Generally, conspicuous classes are more extensively annotated and have higher label frequencies, i.e., the fraction of labeled, positive examples in all examples of individual classes. The unbalanced label frequency distribution means that the predicted probability of an inconspicuous class could be under-estimated more than that of a conspicuous one, causing a long tail. To produce meaningful scene graphs, the inconspicuous but informative predicates need to be properly predicted. To the best of our knowledge, none of the existing SGG debiasing methods (Chen et al., 2019c; Tang et al., 2020; Yan et al., 2020; Wang et al., 2020a; Wei et al., 2020; He et al., 2020) effectively solve this reporting bias problem.

In this paper, we propose to tackle the reporting bias problem by removing the effect of unbalanced label frequency distribution. That is, we aim to recover the unbiased version of per-class predicted probabilities such that they are independent of the per-class missing label bias. To do this, we first show that learning a SGG model with the VG dataset can viewed as a Learning from Positive and Unlabeled data (PU learning) (Denis et al., 2005; Elkan and Noto, 2008; Bekker and Davis, 2020) problem, where a target PU dataset contains only positive examples and unlabeled data. For clarity, we define that a biased model is trained on a PU dataset by treating the unlabaled data as negatives and outputs biased probabilities, while an unbiased model is trained on a fully-annotated dataset and outputs unbiased probabilities. Under the PU learning setting, the per-class unbiased probabilities are proportional to the biased ones with the per-class label frequencies as the proportionality constants (Elkan and Noto, 2008). Motivated by this fact, we propose to recover the unbiased visual relationship probabilities from the biased ones by dividing by the estimated per-class label frequencies so that the imbalance (i.e., reporting bias) can be offset. Especially, the inconspicuous predicates with their probabilities being under-estimated more could then be predicted with higher confidences so that the scene graphs are more informative. An illustrative comparison of the traditional, biased method and our unbiased one is shown in Fig. 1.

A traditional estimator of label frequencies is the per-class average of biased probabilities on a training/validation set predicted by a biased model (Elkan and Noto, 2008). While this estimator can work in the easier SGG settings where ground truth bounding boxes are given, i.e., PredCls and SGCls, it is found unable to provide estimates for some classes in the hardest SGG setting where no additional information other than images is provided, i.e. SGDet. The reason is that there are no valid examples (i.e., predicted object pairs that match ground truth boxes and object labels) can be used for label frequency estimation. For instance, by forwarding a trained MOTIFS (Zellers et al., 2018) model on VG training set, 9 out of 50 predicates do not have even a single valid example, making it impossible to estimate. In this paper, we propose to take advantage of the training-time data augmentation such as random flipping to increase the number of valid examples. That is, instead of performing post-training estimations, we propose Dynamic Label Frequency Estimation (DLFE) utilizing augmented training examples by maintaining a moving average of the per-batch biased probability during training. The significant increase in the number of valid examples, especially in SGDet, enables accurate label frequency estimation for unbiased probability recovery.

Our contribution in this work is three-fold. First, we are among the first to tackle the long tail problem in SGG from the perspective of reporting bias, which we remove by recovering the per-class unbiased probability from the biased one with a PU based approach. Second, to obtain accurate label frequency estimates for recovering unbiased probabilities in SGG, we propose DLFE which takes advantage of training-time data augmentation and averages over multiple training iterations to introduce more valid examples. Third, we show that DLFE provides more reliable label frequency estimates than a naive variant of the traditional estimator, and we demonstrate that SGG models with DLFE effectively alleviates the long tail and achieve state-of-the-art debiasing performance with remarkably more informative scene graphs. We will release the source code to facilitate research towards an unbiased SGG methodology.

2. Related Work

2.1. Scene Graph Generation (SGG)

SGG (Lu et al., 2016) aims to generate pairwise visual relationships in the form of (subject-predicate-object) among salient objects, and there exists three training and evaluation settings (Xu et al., 2017; Zellers et al., 2018): (1) Predicate Classification (PredCls) predicting relationships given ground truth bounding boxes and object labels, (2) Scene Graph Classification (SGCls) predicting relationships and object labels given bounding boxes and (3) Scene Graph Detection (SGDet) predicting relationships, object labels and bounding boxes with only input images.

Typically, SGG models consist of three main modules: proposal generation, object classification, and relationship prediction. Generally, a pre-trained object detection model (e.g., (Ren et al., 2015)) is adopted for generating proposals. For object classification, instead of using the predictions of object detection models directly, the generated proposals and their features are usually refined into object contexts (Xu et al., 2017; Yang et al., 2018; Zellers et al., 2018; Chen et al., 2019b; Tang et al., 2019) followed by decoding into object labels. A common way to take object contexts into consideration is to run message passing algorithms (e.g., (Hochreiter and Schmidhuber, 1997; Tai et al., 2015; Li et al., 2015)) on a fully-connected (Xu et al., 2017; Yang et al., 2018; Chen et al., 2019b), chained (Zellers et al., 2018) or tree-structured (Tang et al., 2019) graph. For relationship prediction, most approaches (Zellers et al., 2018; Tang et al., 2020; Chen et al., 2019b) take in object contexts and bounding box features to compute relation contexts with a similar graphical manner. However, not until the recent works (Tang et al., 2019; Chen et al., 2019b) proposed the less biased mean recall metrics did the research communities in SGG pay attention to the class imbalance problem (Gu et al., 2019; Dornadula et al., 2019; Lin et al., 2020; Tang et al., 2020; Yan et al., 2020; Wang et al., 2020a). Tang et al. (Tang et al., 2020) borrow the counterfactual idea from causal inference to remove the context-specific bias. Yan et al. (Yan et al., 2020) propose to perform re-weighting with class relatedness-aware weights. Wang et al. (Wang et al., 2020a) transfer the less-biased knowledge from the secondary learner to the main one with knowledge distillation.

Our proposed method can be viewed as a model-agnostic debiasing method (Tang et al., 2020; Yan et al., 2020; Wang et al., 2020a). However, instead of focusing on class relatedness (Yan et al., 2020), context co-occurrence bias (Tang et al., 2020) or missing label bias (Wang et al., 2020a), we tackle the underlying reporting bias (Misra et al., 2016) by dealing with the unbalanced label frequency distribution.

Figure 2. An illustration of training and inferencing a SGG model in a PU manner with Dynamic Label Frequency Estimation (DLFE). Given an input image, proposals and their features are extracted by an object detector. Object classification is performed via message passing on a (e.g., chained (Zellers et al., 2018)) graph followed by object contexts decoding. Object contexts together with bounding boxes and features are then fed into another graph to refine into relation contexts, followed by decoding into the biased probabilities . DLFE dynamically estimates the label frequencies with the moving averages of biased probabilities during training. Finally, the unbiased probability of class is recovered with during inference.

2.2. Positive Unlabeled (PU) learning

While the traditional classification setting aims to learn classifiers with both positive and negative data,

Learning from Positive and Unlabeled data, or Positive Unlabeled (PU) learning, is a variant of the traditional setting where a PU dataset contains only positive and unlabeled examples (Denis et al., 2005; Elkan and Noto, 2008; Bekker and Davis, 2020). That is, an unlabeled example can either be truly a negative, or belongs to one or more classes. Learning a biased classifier assuming all unlabeled examples are negative (which is the default setting for most of the existing SGG works) could introduce missing label bias, producing unbalanced predictions. Common PU learning methods can be roughly divided into two categories (Bekker and Davis, 2020): (a) training an unbiased model, and (b) inferencing a biased model in a PU manner. We adopt the latter approach in this paper due to its convenience and favorable flexibility.

We note that while Chen et al. (Chen et al., 2019a) also deal with SGG in the PU setting, they do not dive deep into the long tail problem in scene graphs as we do in this paper. They propose a three-stage approach which generates pseudo-labels for the unlabeled examples with a biased trained model, followed by training a less biased model with the additional “positive” examples. However, their approach is time and resource consuming since it requires re-generating pseudo labels if different SGG models are used. Unlike (Chen et al., 2019a), our approach not only can be easily adapted for any SGG model with minimal modification, but is superior in terms of debiasing performance.

3. Methodology

Scene graph generation aims to generate a graph comprising bounding boxes , object labels , and visual relationships , given an input image . The SGG task is usually decomposed into three components for joint training (Zellers et al., 2018):

(1)

where denotes proposal generation, means object classification and is relationship prediction. We propose to biasedly train a SGG model while we dynamically estimate the label frequencies during training. The estimated label frequencies are then used to recover the unbiased probabilities during inference.

We describe our choice of proposal generation, object classification and relationship prediction in Section 3.1. We then explain how we recover the unbiased probabilities from the biased ones from a PU perspective in Section 3.2, followed by presenting our Dynamic Label Frequency Estimation (DLFE) in Section 3.3. Figure 2 shows an illustration of DLFE applied to SGG models like (Zellers et al., 2018; Tang et al., 2019).

3.1. Model Components

3.1.1. Proposal Generation

Given an image , we adopt a pre-trained object detector (Ren et al., 2015) to extract object proposals , together with their visual representation and union bounding box representations pooled from the output feature map. The visual representations also come with predicted class probabilities: and .

3.1.2. Object Classification

For object classification, a graphical representation is constructed which takes in object features and class probabilities and outputs object context refined with message passing algorithms. We experiment our methods with either chained-structured graphs (Zellers et al., 2018) with bi-directional LSTM (Hochreiter and Schmidhuber, 1997), or tree-structured graphs (Tang et al., 2019) with TreeLSTM (Tai et al., 2015)

. The output object contexts are then fed into a linear layer followed with a Softmax layer to decode into predicted object labels

.

3.1.3. Relationship Prediction

Similar to that of object classification, another graphical representation of the same type is established to propagate contexts between features. The module takes in both the object labels and the object contexts and outputs refined relation contexts . For each object pair , their relation contexts , bounding boxes , union bounding boxes and features are gathered into a pairwise feature

for decoding into a probability vector over the

classes with MLPs followed by a Softmax layer.

Figure 3. The per-class ratio of valid examples in all examples of VG150 training set (Xu et al., 2017)

(SGDet), by inferencing a trained MOTIFS (Train-Est) or dynamically inferencing a training MOTIFS with augmented data (DLFE). Numbers of DLFE are averaged over all epochs. All numbers can exceed 1 as a ground truth pair can match multiple proposal pairs.

3.2. Recovering the Unbiased Scene Graphs

Learning SGG from a dataset with missing labels can be viewed as a PU learning problem, which is different from the traditional classification in that (a) no negative examples are available, and (b) unlabeled examples can either be truly negatives or belong to any class. Learning classifiers from a PU dataset by treating all unlabeled data as negatives could introduce strong missing label bias (Elkan and Noto, 2008), i.e., predicted probabilities could be under-estimated, and reporting bias (Misra et al., 2016), i.e., predicted probability of an inconspicuous class could be under-estimated more than that of a conspicuous one. We propose to avoid the both biases by recovering the unbiased probabilities, marginalizing the effect of uneven label frequencies.

Given predicate classes, we denote the visual relation examples taken in by the relationship prediction module of a SGG model by a set of tuple , with an example (i.e., pairwise object features), the true predicate class (0 means the background class) and the relation label (0 means unannotated). The class cannot be observed from the dataset: while we can derive if the example is labeled (), can be any number ranging from to for an unlabeled example ().

For clarity, we now regard , and

as random variables. For a target class

, a biased SGG model is trained to predict the biased probability , which can be derived as follows:

(2)
(3)

where is the probability of example being selected to be labeled and is called propensity score (Bekker and Davis, 2020). Dividing each side by we obtain the unbiased probability :

(4)

However, as it is unrealistic to obtain the propensity scores of each , the existing works propose to (Elkan and Noto, 2008; Chen et al., 2019a) bypass the dependence on each by making the Selected Completely At Random (SCAR) assumption (Bekker and Davis, 2020): non-background examples are selected for labeling entirely at random regardless of , i.e., the set of labeled examples is uniformly drawn from the set of positive examples. This means that and Eqn 4 can be written as

(5)

where is the label frequency of class , or , which is the fraction of labeled examples in all the examples of class . Notably, discounting the effect of per-class label frequencies in this way also removes the reporting bias. Since label frequencies are usually not provided by annotators, an estimation is required.

3.3. Dynamic Label Frequency Estimation

One of the most common estimators of label frequencies, named Train-Est, is the per-class average of biased probabilities predicted by a biased model (Elkan and Noto, 2008) (see full derivation in Appendix A):

(6)

where denotes a training or validation set and is the cardinality of . However, we find this way of estimation inconvenient and unsuitable for SGG. To understand why, recall that PredCls, SGCls and SGDet are the three SGG training and evaluation settings, and note that re-estimation of label frequencies is required for each setting since the expected biased probabilities could vary depending on the task difficulty222Using label frequencies estimated in other mode is found to degrade the performance.. Firstly, the post-training estimation required before inferencing in each SGG setting is inconvenient and unfavorable. Secondly, the absence of ground truth bounding boxes in SGDet mode results in lack of valid examples for label frequency estimation. For a proposal pair to be valid, its both objects must match ground truth boxes (with IoU ) and object labels simultaneously. By using Train-Est with MOTIFS (Zellers et al., 2018), as revealed in in Fig. 3 (the blue bars), 9 out of 50 predicates do not have even a valid example, i.e., in Eqn. 6 is empty, making it impossible to compute. In addition, more valid examples are missing for inconspicuous classes: as the examples of those classes are concentrated in a much smaller number of images, not matching a bounding box could invalidate lots of examples. A naive remedy is using a default value for those missing estimates; however, as we show in section 4.3 the performance is sub-optimal.

Input : Training dataset and momentum
Output : Biased model and estimated label frequency
for each mini batch : do
       Forward model to obtain the biased probabilities ;
       // in-batch average of biased probabilities
       for each predicate class : do
             ;
             ;
             // Update the exponential moving average
             ;
            
       end for
      
end for
;
Algorithm 1 DLFE during training time

To alleviate this problem, we propose to take advantage of the training-time data augmentation to get back more valid examples for tail classes. Concretely, during training we augment input data by horizontal flipping with a probability of , and meanwhile we estimate label frequencies with per-batch biased probabilities. By doing this, the number of valid examples of tail classes could become more normal (and higher) than that of Train-Est, since averaging over augmented examples and multiple training iterations (with varying object label predictions) essentially introduces more samples, which in turn increases the number of valid examples.

Based on this idea, we propose Dynamic Label Frequency Estimation (DLFE) where the main steps are shown in Algorithm 1. In detail, we maintain per-class moving averages of the biased probabilities (Eqn. 6) throughout the training. The estimated label frequencies are dynamically updated by the per-batch averages with a momentum so that the estimates that are more recent matter more: . Note that for each mini-batch we update the estimated of class only if at least one valid example of presents in the current batch. The estimated values gradually stabilize along with the converging model, and we save the final estimates (as a vector of length

). During inference, the estimated label frequencies are utilized to recover the unbiased probability distribution

from the biased one by

(7)

where denotes the Hadamard (element-wise) product. The average per-epoch number of valid examples of this way is shown in Fig. 3 (the tangerine bars), where the inconspicuous classes get remarkably more ( or more) valid examples. This not only enables accurate estimations for all the classes but makes the estimations easier as no additional, post-training estimation is required.

4. Experiments

4.1. Evaluation Settings

We follow the recent efforts in SGG (Zellers et al., 2018; Chen et al., 2019b) to experiment on a subset of the VG dataset (Krishna et al., 2017) named VG150 (Xu et al., 2017), which contains images for training, for validation and for testing. As discussed earlier, we train and evaluate in the three SGG settings: PredCls, SGCls and SGDet. We evaluate models with, or without graph constraint: whether only a single relation with the highest confidence is predicted for each object pair. Non-graph constraint is denoted as “ng”. For evaluation, we adopt recall-based metrics which measures the fraction of ground truth visual relations appearing in top- confident predictions, where is 20, 50, or 100. However, as the plain recall could be dominated by a biased model predicting mostly head classes, we follow (Chen et al., 2019b; Tang et al., 2019; Yan et al., 2020; Wang et al., 2020a) to average over per-class recall and focus on the less biased per-class/mean recall (mR@) and non-graph constraint per-class/mean recall (ng-mR@)333While results in graph constraint recalls (R) and mean recalls (mR) are less reflective of how unbiased a SGG model is, we provide them for reference in Appendix D.. We note that the ng per-class/mean recall should be the fairest measure for debiasing methods since it 1) treats each class equally and 2) reflects the fact that more than one visual relations could exist for an object pair. We follow the long-tailed recognition research (Liu et al., 2019) to divide the distribution into three parts, including head (many-shot; top-15 frequent predicates), middle (medium-shot; mid-20) and tail (few-shot; last-15) and compute their ng-mRs. Note that by DLFE in this section, we mean the dynamic label frequency estimation along with our unbiased scene graph recovery approach.

4.2. Implementation Details

As DLFE is a model-agnostic strategy, we experiment with two popular SGG backbones: MOTIFS (Zellers et al., 2018) and VCTree (Tang et al., 2019). Following (Tang et al., 2020; Wang et al., 2020a), we adopt a pre-trained and frozen Faster R-CNN (Ren et al., 2015) with ResNeXt-101-FPN (Lin et al., 2017a) as the object detector, which achieves 28.14 mAP on VG’s testing set (Tang et al., 2020)

. All the hyperparameters, including the momentum

, are tuned with the validation set. All models are trained using SGD optimizer with the initial learning rate of after the first iterations of warm-up. Random flipping is applied to all the training examples. The learning rate is decayed, for a maximum of twice, by the factor of 10 once the validation performance plateaus twice consecutively. Training can early stop when the maximum decay step (two) is reached before the maximum 50,000 iterations. The final checkpoint is used for evaluation. The batch size for all experiments is 48 (images). For SGDet setting, we sample 80 proposals from each image and apply per-class NMS (Rosenfeld and Thurston, 1971). Beside ground truth visual relations, we follow (Tang et al., 2020) to sample up to pairs with background-to-ground truth ratio being 3:1.

4.3. Comparing DLFE to Train-Est

Figure 4. The label frequencies estimated by (a) Train-Est or (b) DLFE with MOTIFS (Zellers et al., 2018). The classes with higher label frequency are more conspicuous than those with a lower one. Predicates sorted by class frequency in descending order.
Figure 5. The absolute ng per-class R@100 changes when recovering MOTIFS’s (Zellers et al., 2018) unbiased probabilities with the label frequencies estimated by Train-Est or DLFE, in SGDet mode.
Predicate Classification (PredCls) Scene Graph Classification (SGCls) Scene Graph Detection (SGDet)
Model ng-mR@20 ng-mR@50 ng-mR@100 ng-mR@20 ng-mR@50 ng-mR@100 ng-mR@20 ng-mR@50 ng-mR@100
KERN (Chen et al., 2019b) - 36.3 49.0 - 19.8 26.2 - 11.7 16.0
GB-Net- (Zareian et al., 2020a) - 44.5 58.7 - 25.6 32.1 - 11.7 16.6
MOTIFS (Zellers et al., 2018; Wang et al., 2020a) 19.9 32.8 44.7 11.3 19.0 25.0 7.5 12.5 16.9
MOTIFS-Reweight 20.5 33.5 44.4 12.6 19.1 24.3 8.0 12.9 16.8
MOTIFS-L2+uKD (Wang et al., 2020a) - 36.9 50.9 - 22.7 30.1 - 14.0 19.5
MOTIFS-L2+cKD (Wang et al., 2020a) - 37.2 50.8 - 22.1 29.6 - 14.2 19.8
MOTIFS-TDE (Tang et al., 2020) 18.7 29.0 38.2 10.7 16.1 21.1 7.4 11.2 14.9
MOTIFS-PCPL (Yan et al., 2020) 25.6 38.5 49.3 13.1 19.9 25.6 9.8 14.8 19.6
MOTIFS-STL (Chen et al., 2019a) 15.7 29.4 43.2 10.3 18.4 27.2 6.4 10.6 15.0
MOTIFS-DLFE 30.0 45.8 57.7 17.6 25.6 32.0 11.7 18.1 23.0
VCTree (Tang et al., 2019; Wang et al., 2020a) 21.4 35.6 47.8 14.3 23.3 31.4 7.5 12.5 16.7
VCTree-Reweight 20.6 32.5 41.6 14.1 21.3 27.8 8.0 12.1 15.9
VCTree-L2+uKD (Wang et al., 2020a) - 37.7 51.7 - 26.8 35.2 - 13.8 19.1
VCTree-L2+cKD (Wang et al., 2020a) - 38.4 52.4 - 26.8 35.8 - 13.9 19.0
VCTree-TDE (Tang et al., 2020) 20.9 32.4 41.5 12.4 19.1 25.5 7.8 11.5 15.2
VCTree-PCPL (Yan et al., 2020) 25.1 38.5 49.3 17.2 25.9 32.7 9.9 15.1 19.9
VCTree-STL (Chen et al., 2019a) 16.8 31.8 45.1 12.7 22.0 32.7 6.0 10.0 14.1
VCTree-DLFE 29.1 44.6 56.8 21.6 31.4 38.8 11.7 17.5 22.5
Table 1. Performance comparison in ng-mR@ on VG150 (Krishna et al., 2017; Xu et al., 2017). Models in the first section are with VGG16 backbone (Simonyan and Zisserman, 2014). models implemented or reproduced ourselves with ResNeXt-101-FPN (Lin et al., 2017a) backbone. models also with the same ResNeXt-101-FPN backbone while their performance are reported by the respective papers. model using external knowledge bases.
Predicate Classification (PredCls) Scene Graph Classification (SGCls) Scene Graph Detection (SGDet)
Model mR@20 mR@50 mR@100 mR@20 mR@50 mR@100 mR@20 mR@50 mR@100
IMP+ (Xu et al., 2017; Chen et al., 2019b) - 9.8 10.5 - 5.8 6.0 - 3.8 4.8
FREQ (Zellers et al., 2018; Tang et al., 2019) 8.3 13.0 16.0 5.1 7.2 8.5 4.5 6.1 7.1
MOTIFS (Zellers et al., 2018; Tang et al., 2019) 10.8 14.0 15.3 6.3 7.7 8.2 4.2 5.7 6.6
KERN (Chen et al., 2019b) - 17.7 19.2 - 9.4 10.0 - 6.4 7.3
VCTree (Tang et al., 2019) 14.0 17.9 19.4 8.2 10.1 10.8 5.2 6.9 8.0
GPS-Net (Lin et al., 2020) 17.4 21.3 22.8 10.0 11.8 12.6 6.9 8.7 9.8
GB-Net- (Zareian et al., 2020a) - 22.1 24.0 - 12.7 13.4 - 7.1 8.5
MOTIFS (Zellers et al., 2018; Tang et al., 2020) 13.0 16.5 17.8 7.2 8.9 9.4 5.3 7.3 8.6
MOTIFS-Focal (Lin et al., 2017b; Tang et al., 2020) 10.9 13.9 15.0 6.3 7.7 8.3 3.9 5.3 6.6
MOTIFS-Resample (Burnaev et al., 2015; Tang et al., 2020) 14.7 18.5 20.0 9.1 11.0 11.8 5.9 8.2 9.7
MOTIFS-Reweight 14.3 17.3 18.6 9.5 11.2 11.7 6.7 9.2 10.9
MOTIFS-L2+uKD (Wang et al., 2020a) 14.2 18.6 20.3 8.6 10.9 11.8 5.7 7.9 9.5
MOTIFS-L2+cKD (Wang et al., 2020a) 14.4 18.5 20.2 8.7 10.7 11.4 5.8 8.1 9.6
MOTIFS-TDE (Tang et al., 2020) 17.4 24.2 27.9 9.9 13.1 14.9 6.7 9.2 11.1
MOTIFS-PCPL (Yan et al., 2020) 19.3 24.3 26.1 9.9 12.0 12.7 8.0 10.7 12.6
MOTIFS-STL (Chen et al., 2019a) 13.3 20.1 22.3 8.5 12.8 14.1 5.4 7.6 9.1
MOTIFS-DLFE 22.1 26.9 28.8 12.8 15.2 15.9 8.6 11.7 13.8
VCTree (Tang et al., 2019, 2020) 14.1 17.7 19.1 9.1 11.3 12.0 5.2 7.1 8.3
VCTree-Reweight 16.3 19.4 20.4 10.6 12.5 13.1 6.6 8.7 10.1
VCTree-L2+uKD (Wang et al., 2020a) 14.2 18.2 19.9 9.9 12.4 13.4 5.7 7.7 9.2
VCTree-L2+cKD (Wang et al., 2020a) 14.4 18.4 20.0 9.7 12.4 13.1 5.7 7.7 9.1
VCTree-TDE (Tang et al., 2020) 19.2 26.2 29.6 11.2 15.2 17.5 6.8 9.5 11.4
VCTree-PCPL (Yan et al., 2020) 18.7 22.8 24.5 12.7 15.2 16.1 8.1 10.8 12.6
VCTree-STL (Chen et al., 2019a) 14.3 21.4 23.5 10.5 14.6 16.6 5.1 7.1 8.4
VCTree-DLFE 20.8 25.3 27.1 15.8 18.9 20.0 8.6 11.8 13.8
Table 2. Performance comparison of SGG models in graph-constraint mR@ on VG150 (Krishna et al., 2017; Xu et al., 2017) testing set. Models in the first section are with VGG16 while the others are with ResNeXt-101-FPN. , and are with the same meanings as in Table 1.
Figure 6. Non-graph constraint per-class Recall@20 (PredCls) change w.r.t. MOTIFS baseline. DLFE significantly improves the mid-to-tail recalls (where the other debiasing methods struggle) without compromising much head classes performance.

We aim to answer the question: whether DLFE is more effective in estimating label frequency than Train-Est, by comparing 1) the consistencies of the estimated label frequencies and 2) their debiasing performance. As discussed earlier that label frequencies of the predicates lacking a valid example cannot be estimated by Train-Est, we thus naively assign the median of the other estimated label frequencies to those missing values.

A comparison of the estimated label frequencies is presented in Fig. 4. It is clear from (a) that, even for the classes that with at least one valid example, the Train-Est estimated values tend to be abnormally high in SGDet setting. Note that while there might be differences in estimated values for different SGG settings, with a same backbone they should still be relatively similar. In contrast, (b) shows DLFE-estimated values are more consistent across the three settings. We also compare their debiasing performance in SGDet444Results in numbers and the other two SGG settings are provided in Appendix B. with the absolute ng per-class R@ change in Fig. 5. Apparently, Train-Est barely improves the per-class recalls especially for tail classes that lacks enough valid examples, while DLFE achieves higher and more consistent improvement across all the predicates.

These results verify the claim that, apart from being more convenient (requiring no post-training estimation), DLFE is more effective than naive Train-Est for providing more reliable estimates.

Head Recalls Middle Recalls Tail Recalls
Model R@50 R@100 R@50 R@100 R@50 R@100
MOTIFS (Zellers et al., 2018; Tang et al., 2020) 65.9 78.6 30.0 45.4 3.3 9.7
MOTIFS-Reweight 57.4 69.2 30.7 43.0 13.3 21.5
MOTIFS-TDE (Tang et al., 2020) 48.3 60.8 34.9 46.1 1.8 5.3
MOTIFS-PCPL (Yan et al., 2020) 66.5 77.6 41.8 55.2 6.0 13.2
MOTIFS-STL (Chen et al., 2019a) 56.4 70.0 24.1 39.8 9.6 21.2
MOTIFS-DLFE 61.9 72.4 42.8 54.2 31.8 44.6
VCTree (Tang et al., 2019, 2020) 67.5 79.8 34.3 50.0 5.5 12.7
VCTree-Reweight 61.6 73.4 28.3 38.3 9.0 14.3
VCTree-TDE (Tang et al., 2020) 54.8 67.5 37.9 49.1 2.5 5.4
VCTree-PCPL (Yan et al., 2020) 64.5 75.9 42.6 54.2 6.9 16.1
VCTree-STL (Chen et al., 2019a) 57.6 71.1 26.1 41.8 13.8 23.5
VCTree-DLFE 57.5 68.3 36.0 48.2 26.5 38.1
Table 3. Non-graph constraint head, middle and tail recalls (PredCls). and are with the same meanings as in Table 1. DLFE improves the tail recalls by a large margin.
Figure 7. Scene graphs generated by MOTIFS (left) and MOTIFS-DLFE (right) in PredCls. Only the top-1 prediction is shown for each object pair. A prediction can be correct (matches GT), incorrect (does not match GT and weird) or acceptable (does not match GT but still reasonable).

4.4. Comparing to other Debiasing Methods

While we list the results of different SGG backbone model for reference, we mainly compare our approach with the model-agnostic debiasing methods including Focal Loss (Lin et al., 2017b), Resampling (Burnaev et al., 2015), Reweighting, L2+{u,c}KD (Wang et al., 2020a), TDE (Tang et al., 2020), PCPL (Yan et al., 2020) and STL (Chen et al., 2019a). L2+{u,c}KD is a two-learner knowledge distillation framework for reducing dataset biases. TDE is an inference-time debiasing method which ensembles counterfactual thinking by removing context-specific bias. PCPL learns the relatedness scores among predicates which are used as the weights in reweighting, and is the current state-of-the-art in terms of mR. STL generates soft pseudo-labels for unlabeled data used for joint training. We re-implement PCPL and STL since their backbone is not directly comparable, and Reweighting since there does not exist its reported performance for VCTree. We report our reproduced results of TDE with the authors’ codebase (Tang, 2020).

Figure 8. Probability distributions (normalized to sum 1) over the classes by MOTIFS (top of each example) and MOTIFS-DLFE (bottom). The top-1 predictions can be correct (GT), incorrect (Non-GT and weird) or acceptable (Non-GT but reasonable).

The results in ng-mR@ are presented in Table 1, where DLFE significantly improves ng-mRs for both MOTIFS and VCTree and outperforms the existing debiasing methods. Notably, TDE, which was proposed to alleviate the long tail by removing the context-specific bias, is shown to adversely affect the ability of predicting multiple relations for per object pair. This shows that removing the reporting bias is more beneficial for debiasing SGG models.

While the graph-constraint mR metric does not reflect the fact that multiple relations could exist between an object pair, due to its popularity we still present the results in Table 2. Debiasing MOTIFS with our proposed DLFE still significantly improves its mR, achieving state-of-the-art mR across all the three setting. Large performance boost are also seen in VCTree with DLFE, and new SOTAs are attained for PredCls (mR@20), SGCls and SGDet.

To better understand how DLFE affects the performance of each class, we also present non-graph constraint per-class recall@20 changes compared to MOTIFS backbone-only (biased classifier), in Figure 6. While all the debiasing methods increase the recall of the less frequent, middle-to-tail classes, only DLFE improves the tail (last-15) classes’ performance significantly. The other approach that also visibly improves the tail classes’ performance is Reweighting; however, their relatively small improvements demonstrate that naively dealing with the unbalanced class frequencies is less effective than tackling the reporting bias.

We also present the head (many shot), middle (medium shot) and tail (few shot) non-graph constraint recalls in PredCls555The full results with graph-constraint, or SGCls/SGDet are available in Appendix C. with the MOTIFS backbone in Table 3. Remarkably, DLFE outperforms the others by a significant margin regarding the tail recalls, e.g., Tail R@50 is 31.8 for DLFE, versus 1.8 for TDE, versus 6.0 for PCPL, versus 13.3 for Reweighting, showing that DLFE is especially capable of dealing with the long tail.

4.5. Qualitative Results

The scene graphs of three testing images are visualized in Figure 7, where the scene graphs on the left side are generated by MOTIFS and those on the right are by MOTIFS-DLFE. (a) is an apparent example that, while wheel-on-car, car-on-street, hair-on-man predicted by MOTIFS are reasonable, wheel-mounted on-car, car-parked on-street, hair-belonging to-man predicted by MOTIFS-DLFE match the ground truth and are also more descriptive (while being inconspicuous). Similarly, tree-growing on-hill in Example (b) and woman-standing on-beach in (c) are also correct and more descriptive; however, due to the missing label issue in the VG dataset, tree-growing on-hill can not be correctly recalled (shown as tangerine color). In addition, there are some seemingly incorrect annotations such as tree-standing on-tree in Example (a), where the subject actually indicates a smaller, branch part of a tree. For this object pair, MOTIFS-DLFE predicts growing on which, ironically, seems to be more reasonable than the ground truth label.

To understand how DLFE changes the probability distribution, we visualize the biased (MOTIFS) and unbiased (MOTIFS-DLFE) probabilities, given a subject-object pair, in Figure 8.666More visualizations are available in Appendix E. Prediction confidences are shown to be calibrated towards minor but expressive predicates like (a) car-parked on-street, (b) wheel-of-train, (d) people-sitting on-bench (while sitting on is not in the ground truth). Notably in (c), fork is actually not on the plate but was mis-predicted by MOTIFS due to the strong bias (i.e., many fork-in/on-plate examples in VG dataset), while MOTIFS-DLFE correctly predicts near. Moreover, (b) shows that the confidences of MOTIFS-DLFE for predicates other than the GT of, such as mounted on and part of, have increased remarkably, presumably because they are also reasonable choices. This demonstrates the effectiveness of DLFE for balanced SGG.

5. Conclusions

In this paper, we deal with the long tail problem in SGG with the cause (unbalanced missing labels) instead of its superficial effect (long-tailed class distribution). To ward off the reporting bias caused by the imbalance in missing labels, we view SGG as a PU learning problem and we remove the per-class missing label bias by recovering the unbiased probabilities from the biased ones. To obtain reliable label frequencies for unbiased probability recovery, we take advantage of the data augmentation during training and perform Dynamic Label Frequency Estimation (DLFE) which maintains the moving averages of per-class biased probability and effectively introduces more valid samples, especially in SGDet training and evaluation mode. Extensive quantitative and qualitative experiments demonstrate that DLFE is more effective in estimating label frequencies than a naive variant of the traditional estimator, and SGG models with DLFE achieve state-of-the-art debiasing performance on the VG dataset, producing well-balanced scene graphs.

References

  • I. Armeni, Z. He, J. Gwak, A. R. Zamir, M. Fischer, J. Malik, and S. Savarese (2019) 3d scene graph: a structure for unified semantics, 3d space, and camera. In

    Proceedings of the IEEE/CVF International Conference on Computer Vision

    ,
    pp. 5664–5673. Cited by: §1.
  • J. Bekker and J. Davis (2020) Learning from positive and unlabeled data: a survey. Machine Learning 109 (4), pp. 719–760. Cited by: §1, §2.2, §3.2.
  • E. Burnaev, P. Erofeev, and A. Papanov (2015) Influence of resampling on accuracy of imbalanced classification. In Eighth international conference on machine vision (ICMV 2015), Vol. 9875, pp. 987521. Cited by: §4.4, Table 10, Table 2, Table 8, Table 9.
  • D. Chen, X. Liang, Y. Wang, and W. Gao (2019a)

    Soft transfer learning via gradient diagnosis for visual relationship detection

    .
    In 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 1118–1126. Cited by: §1, §2.2, §3.2, Table 5, Table 6, Table 7, §C, §4.4, Table 1, Table 10, Table 2, Table 3, Table 8, Table 9.
  • T. Chen, W. Yu, R. Chen, and L. Lin (2019b) Knowledge-embedded routing network for scene graph generation. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    ,
    pp. 6163–6171. Cited by: §1, §2.1, §4.1, Table 1, Table 10, Table 2, Table 8, Table 9.
  • V. S. Chen, P. Varma, R. Krishna, M. Bernstein, C. Re, and L. Fei-Fei (2019c) Scene graph prediction with limited labels. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2580–2590. Cited by: §1, §1.
  • M. Chiou, C. Liao, L. Wang, R. Zimmermann, and J. Feng (2021a) ST-hoi: a spatial-temporal baseline for human-object interaction detection in videos. arXiv preprint arXiv:2105.11731. Cited by: §1.
  • M. Chiou, Z. Liu, Y. Yin, A. Liu, and R. Zimmermann (2020) Zero-shot multi-view indoor localization via graph location networks. In Proceedings of the 28th ACM International Conference on Multimedia, pp. 3431–3440. Cited by: §1.
  • M. Chiou, R. Zimmermann, and J. Feng (2021b) Visual relationship detection with visual-linguistic knowledge from multimodal representations. IEEE Access 9, pp. 50441–50451. Cited by: §1.
  • F. Denis, R. Gilleron, and F. Letouzey (2005) Learning from positive and unlabeled examples. Theoretical Computer Science 348 (1), pp. 70–83. Cited by: §1, §2.2.
  • A. Dornadula, A. Narcomey, R. Krishna, M. Bernstein, and F. Li (2019) Visual relationships as functions: enabling few-shot scene graph prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, pp. 0–0. Cited by: §1, §2.1.
  • C. Elkan and K. Noto (2008) Learning classifiers from only positive and unlabeled data. In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 213–220. Cited by: §1, §1, §A, §2.2, Table 4, §3.2, §3.2, §3.3.
  • J. Gu, H. Zhao, Z. Lin, S. Li, J. Cai, and M. Ling (2019) Scene graph generation with external knowledge and image reconstruction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1969–1978. Cited by: §1, §2.1.
  • T. He, L. Gao, J. Song, J. Cai, and Y. Li (2020) Learning from the scene and borrowing from the rich: tackling the long tail in scene graph generation. In

    Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, IJCAI-20

    ,
    pp. 587–593. Note: Main track Cited by: §1, §1.
  • R. Herzig, M. Raboh, G. Chechik, J. Berant, and A. Globerson (2018) Mapping images to scene graphs with permutation-invariant structured prediction. In Advances in Neural Information Processing Systems, pp. 7211–7221. Cited by: §1.
  • S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §2.1, §3.1.2.
  • Z. Hung, A. Mallya, and S. Lazebnik (2020) Contextual translation embedding for visual relationship detection and scene graph generation. IEEE transactions on pattern analysis and machine intelligence. Cited by: §1, Table 10, Table 8, Table 9.
  • M. Khademi and O. Schulte (2020)

    Deep generative probabilistic graph neural networks for scene graph generation

    .
    Proceedings of the AAAI Conference on Artificial Intelligence 34 (07), pp. 11237–11245. External Links: Link, Document Cited by: §1, Table 10, Table 8, Table 9.
  • B. Knyazev, H. de Vries, C. Cangea, G. W. Taylor, A. Courville, and E. Belilovsky (2020) Graph density-aware losses for novel compositions in scene graph generation. In British Machine Vision Conference (BMVC), Cited by: §1, footnote 7.
  • R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. Kalantidis, L. Li, D. A. Shamma, et al. (2017) Visual genome: connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision 123 (1), pp. 32–73. Cited by: §1, §4.1, Table 1, Table 2.
  • X. Li and S. Jiang (2019) Know more say less: image captioning based on scene graphs. IEEE Transactions on Multimedia 21 (8), pp. 2117–2130. Cited by: §1.
  • Y. Li, D. Tarlow, M. Brockschmidt, and R. Zemel (2015) Gated graph sequence neural networks. arXiv preprint arXiv:1511.05493. Cited by: §2.1.
  • T. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie (2017a) Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2117–2125. Cited by: §4.2, Table 1.
  • T. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár (2017b) Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision, pp. 2980–2988. Cited by: §4.4, Table 10, Table 2, Table 8, Table 9.
  • X. Lin, C. Ding, J. Zeng, and D. Tao (2020) GPS-net: graph property sensing network for scene graph generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §2.1, Table 10, Table 2, Table 8, Table 9.
  • Z. Liu, Z. Miao, X. Zhan, J. Wang, B. Gong, and S. X. Yu (2019) Large-scale long-tailed recognition in an open world. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2537–2546. Cited by: §4.1.
  • C. Lu, R. Krishna, M. Bernstein, and L. Fei-Fei (2016) Visual relationship detection with language priors. In European Conference on Computer Vision, pp. 852–869. Cited by: §1, §2.1.
  • I. Misra, C. Lawrence Zitnick, M. Mitchell, and R. Girshick (2016) Seeing through the human reporting bias: visual classifiers from noisy human-centric labels. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2930–2939. Cited by: §1, §2.1, §3.2.
  • G. Ren, L. Ren, Y. Liao, S. Liu, B. Li, J. Han, and S. Yan (2021) Scene graph generation with hierarchical context. IEEE Transactions on Neural Networks and Learning Systems 32 (2), pp. 909–915. External Links: Document Cited by: §1, Table 10, Table 8, Table 9.
  • S. Ren, K. He, R. Girshick, and J. Sun (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pp. 91–99. Cited by: §2.1, §3.1.1, §4.2.
  • A. Rosenfeld and M. Thurston (1971) Edge and curve detection for visual scene analysis. IEEE Transactions on computers 100 (5), pp. 562–569. Cited by: §4.2.
  • S. Sharifzadeh, S. M. Baharlou, and V. Tresp (2020) Classification by attention: scene graph classification with prior knowledge. arXiv preprint arXiv:2011.10084. Cited by: §1.
  • J. Shi, H. Zhang, and J. Li (2019) Explainable and explicit visual reasoning over scene graphs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8376–8384. Cited by: §1.
  • K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: Table 1, Table 10, Table 8, Table 9.
  • K. S. Tai, R. Socher, and C. D. Manning (2015) Improved semantic representations from tree-structured long short-term memory networks. In

    Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing

    ,
    Beijing, China, pp. 1556–1566. Cited by: §2.1, §3.1.2.
  • K. Tang, Y. Niu, J. Huang, J. Shi, and H. Zhang (2020) Unbiased scene graph generation from biased training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3716–3725. Cited by: §1, §1, §1, §2.1, §2.1, Table 4, Table 5, Table 6, Table 7, §C, §4.2, §4.4, Table 1, Table 10, Table 2, Table 3, Table 8, Table 9, footnote 7.
  • K. Tang, H. Zhang, B. Wu, W. Luo, and W. Liu (2019) Learning to compose dynamic tree structures for visual contexts. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6619–6628. Cited by: §2.1, Table 4, §B, §3.1.2, Table 5, Table 6, Table 7, §3, §4.1, §4.2, Table 1, Table 10, Table 2, Table 3, Table 8, Table 9.
  • K. Tang (2020)

    A scene graph generation codebase in pytorch

    .
    Note: https://github.com/KaihuaTang/Scene-Graph-Benchmark.pytorch Cited by: §4.4.
  • D. Teney, L. Liu, and A. van Den Hengel (2017) Graph-structured representations for visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–9. Cited by: §1.
  • H. Tian, N. Xu, A. Liu, and Y. Zhang (2020) Part-aware interactive learning for scene graph generation. In Proceedings of the 28th ACM International Conference on Multimedia, pp. 3155–3163. Cited by: §1, Table 10, Table 8, Table 9.
  • T. J. Wang, S. Pehlivan, and J. Laaksonen (2020a) Tackling the unannotated: scene graph generation with bias-reduced models. In 31st British Machine Vision Conference 2020, BMVC 2020, Virtual Event, UK, September 7-10, 2020, Cited by: §1, §1, §2.1, §2.1, §4.1, §4.2, §4.4, Table 1, Table 10, Table 2, Table 8, Table 9.
  • W. Wang, R. Wang, S. Shan, and X. Chen (2020b) Sketching image gist: human-mimetic hierarchical scene graph generation. arXiv preprint arXiv:2007.08760. Cited by: §1.
  • M. Wei, C. Yuan, X. Yue, and K. Zhong (2020) HOSE-net: higher order structure embedded network for scene graph generation. In Proceedings of the 28th ACM International Conference on Multimedia, MM ’20, New York, NY, USA, pp. 1846–1854. External Links: ISBN 9781450379885 Cited by: §1, §1, Table 10, Table 8, Table 9.
  • D. Xu, Y. Zhu, C. B. Choy, and L. Fei-Fei (2017) Scene graph generation by iterative message passing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5410–5419. Cited by: §2.1, §2.1, Figure 3, §4.1, Table 1, Table 10, Table 2, Table 8, Table 9.
  • S. Yan, C. Shen, Z. Jin, J. Huang, R. Jiang, Y. Chen, and X. Hua (2020) PCPL: predicate-correlation perception learning for unbiased scene graph generation. In Proceedings of the 28th ACM International Conference on Multimedia, pp. 265–273. Cited by: §1, §1, §2.1, §2.1, Table 5, Table 6, Table 7, §C, §4.1, §4.4, Table 1, Table 10, Table 2, Table 3, Table 8, Table 9.
  • J. Yang, J. Lu, S. Lee, D. Batra, and D. Parikh (2018) Graph r-cnn for scene graph generation. In Proceedings of the European Conference on Computer Vision, pp. 670–685. Cited by: §1, §2.1.
  • X. Yang, K. Tang, H. Zhang, and J. Cai (2019) Auto-encoding scene graphs for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 10685–10694. Cited by: §1.
  • T. Yao, Y. Pan, Y. Li, and T. Mei (2018) Exploring visual relationship for image captioning. In Proceedings of the European Conference on Computer Vision, pp. 684–699. Cited by: §1.
  • F. Yu, H. Wang, T. Ren, J. Tang, and G. Wu (2020) Visual relation of interest detection. In Proceedings of the 28th ACM International Conference on Multimedia, MM ’20, New York, NY, USA, pp. 1386–1394. External Links: ISBN 9781450379885 Cited by: §1.
  • C. Yuren, H. Ackermann, W. Liao, M. Y. Yang, and B. Rosenhahn (2020) NODIS: neural ordinary differential scene understanding. CoRR abs/2001.04735. Cited by: §1, Table 10, Table 8, Table 9.
  • A. Zareian, S. Karaman, and S. Chang (2020a)

    Bridging knowledge graphs to generate scene graphs

    .
    arXiv preprint arXiv:2001.02314. Cited by: §1, Table 1, Table 10, Table 2, Table 8, Table 9.
  • A. Zareian, H. You, Z. Wang, and S. Chang (2020b) Learning visual commonsense for robust scene graph generation. arXiv preprint arXiv:2006.09623. Cited by: §1.
  • R. Zellers, M. Yatskar, S. Thomson, and Y. Choi (2018) Neural motifs: scene graph parsing with global context. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5831–5840. Cited by: §1, §1, §1, Figure 2, §2.1, §2.1, Table 4, §B, §3.1.2, §3.3, Table 5, Table 6, Table 7, §3, §3, Figure 4, Figure 5, §4.1, §4.2, Table 1, Table 10, Table 2, Table 3, Table 8, Table 9.

Appendices

A. Deriving the Train-Est Estimator

We show that why biased probabilities can be used as a label frequency estimator in this section. Denote an annotated example in a training or validation set with , where is the pairwise example, is the relation label and is the true class. Note as we are only considering the annotated ones. Referring to (Elkan and Noto, 2008), the biased probability of class can be derived as follows:

where is the label frequency of class . Thus, we can obtain a reasonable label frequency estimate via averaging the per-class biased probability with a training or validation set.

B. More results of Train-Est and DLFE

We present additional results in non-graph constraint mean recall (ng-mR@) in Table 4. Across both the MOTIFS (Zellers et al., 2018) and VCTree (Tang et al., 2019) backbone, our DLFE achieves significantly higher ng-mRs in both PredCls and SGDet setting and is on par with Train-Est in SGCls setting.

Predicate Classification (PredCls) Scene Graph Classification (SGCls) Scene Graph Detection (SGDet)
Model ng-mR@20 ng-mR@50 ng-mR@100 ng-mR@20 ng-mR@50 ng-mR@100 ng-mR@20 ng-mR@50 ng-mR@100
MOTIFS (Zellers et al., 2018; Tang et al., 2020) 19.9 32.8 44.7 11.3 19.0 25.0 7.5 12.5 16.9
MOTIFS-Train-Est (Elkan and Noto, 2008) 24.4 38.9 50.5 17.1 26.1 32.8 8.9 14.1 18.9
MOTIFS-DLFE 30.0 45.8 57.7 17.6 25.6 32.0 11.7 18.1 23.0
VCTree (Tang et al., 2019, 2020) 21.4 35.6 47.8 12.4 19.1 25.5 7.5 12.5 16.7
VCTree-Train-Est (Elkan and Noto, 2008) 25.0 39.1 52.4 21.0 32.2 39.4 8.1 13.0 17.1
VCTree-DLFE 29.1 44.6 56.8 21.6 31.4 38.8 11.7 17.5 22.5
Table 4. Comparison of non-graph constraint mean recalls (ng-mR@) between Train-Est and our DLFE, in PredCls, SGCls and SGDet.

C. Results in Head/Middle/Tail Recalls

We compare our proposed DLFE with other debiasing methods, i.e., Reweighting, TDE (Tang et al., 2020), PCPL (Yan et al., 2020) and STL (Chen et al., 2019a), on the recalls of different part of predicate distribution: (i) head (many-shot), (ii) middle (medium-shot) and (iii) tail (few-shot) recall. The bar plots of head, middle and tail recalls (non-graph constraint) of MOTIFS and VCTree backbones are presented in Figure 9. In all the three SGG tasks, both of MOTIFS-DLFE and VCTree-DLFE remarkably outperform other debiasing methods by a large margin regarding the tail recall, while being on par regarding the head and middle recalls.

The recalls with/without graph constraint in numbers for PredCls, SGCls and SGDet task is presented in Table 5, Table 6 and Table 7, respectively. Again, DLFE improves middle and tail recalls more significantly, with the cost of head recall similar to that of other debiasing approaches.

Figure 9. Bar plots of head (many shot), middle (medium shot) and tail (few shot) classes based on either MOTIFS (left) and VCTree (right) backbone, evaluated on VG150. From top to down is results in PredCls, SGCls and SGDet, respectively.
Predicate Classification (PredCls)
Head Recalls Middle Recalls Tail Recalls
Model R/ngR@20 R/ngR@50 R/ngR@100 R/ngR@20 R/ngR@50 R/ngR@100 R/ngR@20 R/ngR@50 R/ngR@100
MOTIFS (Zellers et al., 2018; Tang et al., 2020) 35.5/45.2 42.9/65.9 45.4/78.6 6.0/15.2 9.0/30.0 10.3/45.4 0.0/0.9 0.0/3.3 0.0/9.7
MOTIFS-Reweight 32.4/40.0 38.5/57.4 40.6/69.2 10.1/17.3 12.6/30.7 13.9/43.0 1.9/5.3 2.4/13.3 2.9/21.5
MOTIFS-TDE (Tang et al., 2020) 29.8/32.0 40.8/48.3 46.3/60.8 21.2/22.7 29.8/34.9 34.9/46.1 0.0/0.2 0.0/1.8 0.1/5.3
MOTIFS-PCPL (Yan et al., 2020) 39.4/47.7 47.0/66.5 49.5/77.6 19.5/27.0 25.3/41.8 27.9/55.2 0.2/1.8 0.2/6.0 0.2/13.2
MOTIFS-STL (Chen et al., 2019a) 33.6/37.1 43.4/56.4 46.3/70.0 7.7/10.3 13.5/24.1 16.5/39.8 0.5/1.6 5.5/9.6 6.0/21.2
MOTIFS-DLFE 34.5/44.4 40.7/61.9 42.7/72.4 19.9/27.2 25.3/42.8 27.9/54.2 11.2/18.6 15.1/31.8 15.7/44.6
VCTree (Tang et al., 2019, 2020) 36.6/47.0 43.8/67.5 46.2/79.8 7.8/17.5 11.4/34.3 13.0/50.0 0.0/1.0 0.0/5.5 0.1/12.7
VCTree-Reweight 36.6/43.6 43.1/61.6 45.4/73.4 12.2/16.6 14.6/28.3 15.2/38.3 1.3/2.9 2.3/9.0 2.3/14.3
VCTree-TDE (Tang et al., 2020) 34.0/36.9 45.1/54.8 50.0/67.5 22.5/24.2 31.6/37.9 36.2/49.1 0.1/0.3 0.3/2.5 0.5/5.4
VCTree-PCPL (Yan et al., 2020) 38.0/46.3 45.4/64.5 47.8/75.9 18.1/26.3 23.0/42.6 25.4/54.2 0.1/2.4 0.1/6.9 0.1/16.1
VCTree-STL (Chen et al., 2019a) 34.0/37.8 43.8/57.6 46.5/71.1 8.4/10.9 14.3/26.1 17.1/41.8 2.4/3.6 8.4/13.8 9.1/23.5
VCTree-DLFE 30.1/41.1 37.3/57.5 39.6/68.3 14.4/21.9 19.0/36.0 20.8/48.2 6.0/12.0 8.1/26.5 9.2/38.1
Table 5. Head, middle and tail (with/without graph constraint) recalls in the PredCls task on VG150. and are with the same meaning as in Table 1 of the main paper.
Scene Graph Classification (SGCls)
Head Recalls Middle Recalls Tail Recalls
Model R/ngR@20 R/ngR@50 R/ngR@100 R/ngR@20 R/ngR@50 R/ngR@100 R/ngR@20 R/ngR@50 R/ngR@100
MOTIFS (Zellers et al., 2018; Tang et al., 2020) 21.3/27.4 25.1/39.1 26.3/45.6 2.1/7.3 3.5/16.6 3.9/24.3 0.0/0.5 0.0/2.1 0.0/5.4
MOTIFS-Reweight 21.6/25.7 25.1/35.7 26.3/42.1 6.8/10.2 8.0/16.5 8.3/21.9 0.9/2.6 1.6/6.0 1.7/9.7
MOTIFS-TDE (Tang et al., 2020) 18.0/19.6 23.5/28.4 26.1/35.2 11.2/12.0 15.2/18.5 17.8/24.3 0.0/0.2 0.0/0.6 0.0/2.7
MOTIFS-PCPL (Yan et al., 2020) 23.0/28.2 27.1/38.7 28.3/44.9 7.3/10.9 9.5/18.5 10.3/25.2 0.1/0.9 0.1/3.1 0.2/6.6
MOTIFS-STL (Chen et al., 2019a) 21.3/23.8 26.4/34.8 27.8/42.1 5.2/7.0 8.8/16.2 10.2/25.2 0.2/1.2 4.4/5.2 5.7/15.1
MOTIFS-DLFE 21.2/27.2 24.6/36.6 25.6/42.0 11.3/15.4 14.2/23.5 15.1/29.5 7.1/11.1 8.3/17.7 8.4/24.8
VCTree (Tang et al., 2019, 2020) 25.3/32.8 29.9/46.8 31.3/54.7 3.8/10.5 5.8/21.2 6.4/31.5 0.0/1.0 0.0/2.6 0.0/8.0
VCTree-Reweight 24.2/29.5 28.4/41.2 29.7/49.3 7.1/10.9 8.5/18.0 9.0/25.2 1.7/2.9 1.9/5.9 1.9/9.7
VCTree-TDE (Tang et al., 2020) 19.1/21.6 26.2/33.6 30.0/42.7 13.7/14.5 18.2/21.2 21.2/28.2 0.0/0.5 0.1/1.9 0.1/4.6
VCTree-PCPL (Yan et al., 2020) 27.2/33.6 32.0/46.2 33.5/53.4 11.3/16.8 13.9/26.3 15.2/34.7 0.1/1.3 0.1/5.1 0.1/9.4
VCTree-STL (Chen et al., 2019a) 24.8/27.6 30.7/40.8 32.3/49.7 6.6/9.0 10.3/19.1 12.2/30.5 1.6/2.8 4.1/6.9 6.6/18.7
VCTree-DLFE 22.8/30.2 26.9/41.2 28.3/48.0 15.4/20.4 18.5/29.9 19.8/37.6 9.3/14.7 11.3/23.8 12.0/31.3
Table 6. Head, middle and tail (with/without graph constraint) recalls in the SGCls task on VG150. and are with the same meaning as in Table 1 of the main paper.
Scene Graph Detection (SGDet)
Head Recalls Middle Recalls Tail Recalls
Model R/ngR@20 R/ngR@50 R/ngR@100 R/ngR@20 R/ngR@50 R/ngR@100 R/ngR@20 R/ngR@50 R/ngR@100
MOTIFS (Zellers et al., 2018; Tang et al., 2020) 15.4/18.8 20.7/28.7 24.1/36.2 1.6/4.7 2.8/9.2 3.3/14.2 0.0/0.1 0.0/0.7 0.0/1.3
MOTIFS-Reweight 15.3/17.2 20.4/25.2 23.9/32.2 4.8/6.5 6.4/10.9 7.9/14.6 0.5/0.9 1.8/3.1 1.9/4.5
MOTIFS-TDE (Tang et al., 2020) 12.8/13.7 17.4/20.4 20.9/26.1 7.1/8.2 9.9/12.7 12.1/16.6 0.0/0.0 0.0/0.1 0.0/1.4
MOTIFS-PCPL (Yan et al., 2020) 17.1/19.5 22.5/28.5 26.0/35.7 7.1/9.3 9.8/14.5 12.0/19.8 0.1/0.7 0.1/1.6 0.1/3.3
MOTIFS-STL (Chen et al., 2019a) 14.1/15.4 19.2/23.8 22.9/31.2 3.0/4.3 4.3/7.9 5.4/12.2 0.0/0.3 0.2/1.0 0.4/2.6
MOTIFS-DLFE 15.1/19.6 20.0/27.9 23.4/34.1 8.0/11.0 10.9/16.9 13.3/21.9 3.5/5.3 5.3/11.1 6.7/15.1
VCTree (Tang et al., 2019, 2020) 15.1/18.5 20.1/28.2 23.3/35.6 1.8/4.8 2.6/9.6 3.3/14.0 0.0/0.1 0.0/0.6 0.0/1.3
VCTree-Reweight 14.7/16.8 19.6/24.9 22.7/31.3 4.6/6.0 6.1/9.8 7.2/13.7 1.1/1.8 1.2/2.3 1.3/3.4
VCTree-TDE (Tang et al., 2020) 12.9/14.5 17.8/21.7 21.4/27.3 7.3/8.6 10.3/12.4 12.5/16.5 0.0/0.1 0.0/0.2 0.0/1.4
VCTree-PCPL (Yan et al., 2020) 16.8/19.1 22.0/27.9 25.4/34.7 7.6/10.0 10.3/15.7 12.4/20.7 0.1/0.5 0.1/1.5 0.1/4.0
VCTree-STL (Chen et al., 2019a) 13.6/15.0 18.5/22.8 21.8/29.7 2.4/3.7 3.8/7.4 4.7/11.5 0.1/0.2 0.1/0.7 0.1/1.8
VCTree-DLFE 13.6/18.0 18.2/25.9 21.2/31.9 8.3/11.5 11.3/16.8 13.1/21.3 3.9/5.6 6.0/10.2 7.2/14.7
Table 7. Head, middle and tail (with/without graph constraint) recalls in the SGDet task on VG150. and are with the same meaning as in Table 1 of the main paper.

D. More Results in Recalls

We present a comprehensive comparison of recently published (2020-now) SGG models/debiasing methods777Note that we do not compare with (Knyazev et al., 2020) as 1) their reported numbers are rather selective and incomplete and 2) their method were not compared with other debiasing methods (e.g., (Tang et al., 2020)) fairly, i.e., with the same backbone. in graph-constraint recalls and mean recalls, in PredCls mode in Table 8, in SGCls mode in Table 9 and in SGDet mode in Table 10. We note that the plain recall (R@) is biased as it favors the head classes and does not reflect that multiple relations could exist in an object pair. While there are some performance drops in more conventional recalls (R@) for the debiasing methods, it is because the predicates are being classified into the more descriptive ones (which do not have been annotated as ground truth). Mean recall (mR@) is less biased than the plain recall as it treats all classes equally; however, it still does not consider the multi-relation issue. Remarkably, it is clear from all the three tables that our DLFE still achieves state-of-the-art mR@ comparing the other debiasing methods with the same backbone, and either MOTIFS-DLFE or VCTree-DLFE attains the highest mR scores across all the models and backbones.

Predicate Classification
Model R@20 R@50 R@100 mR@20 mR@50 mR@100
IMP+ (Xu et al., 2017; Chen et al., 2019b) 52.7 59.3 61.3 - 9.8 10.5
FREQ (Zellers et al., 2018; Tang et al., 2019) 53.6 60.6 62.2 8.3 13.0 16.0
UVTransE (Hung et al., 2020) - 61.2 64.3 - - -
MOTIFS (Zellers et al., 2018; Tang et al., 2019) 58.5 65.2 67.1 10.8 14.0 15.3
KERN (Chen et al., 2019b) - 65.8 67.6 - 17.7 19.2
NODIS (Yuren et al., 2020) 58.9 66.0 67.9 - - -
HCNet (Ren et al., 2021) 59.6 66.4 68.8 - - -
VCTree (Tang et al., 2019) 60.1 66.4 68.1 14.0 17.9 19.4
GPS-Net (Lin et al., 2020) 60.7 66.9 68.8 17.4 21.3 22.8
GB-Net- (Zareian et al., 2020a) - 66.6 68.2 - 22.1 24.0
HOSE-Net (Wei et al., 2020) - 66.7 69.2 - - -
Part-Aware (Tian et al., 2020) 61.8 67.7 69.4 15.2 19.2 20.9
DG-PGNN (Khademi and Schulte, 2020) - 69.0 72.1 - - -
MOTIFS (Zellers et al., 2018; Tang et al., 2020) 59.0 65.5 67.2 13.0 16.5 17.8
MOTIFS-Focal (Lin et al., 2017b; Tang et al., 2020) 59.2 65.8 67.7 10.9 13.9 15.0
MOTIFS-Resample (Burnaev et al., 2015; Tang et al., 2020) 57.6 64.6 66.7 14.7 18.5 20.0
MOTIFS-Reweight 45.4 54.7 56.5 14.3 17.3 18.6
MOTIFS-L2+uKD (Wang et al., 2020a) 57.4 64.1 66.0 14.2 18.6 20.3
MOTIFS-L2+cKD (Wang et al., 2020a) 57.7 64.6 66.4 14.4 18.5 20.2
MOTIFS-TDE (Tang et al., 2020) 32.9 45.0 50.6 17.4 24.2 27.9
MOTIFS-PCPL (Yan et al., 2020) 48.4 54.7 56.5 19.3 24.3 26.1
MOTIFS-STL (Chen et al., 2019a) 56.5 65.0 66.9 13.3 20.1 22.3
MOTIFS-DLFE 46.4 52.5 54.2 22.1 26.9 28.8
VCTree (Tang et al., 2019, 2020) 59.8 65.9 67.5 14.1 17.7 19.1
VCTree-Reweight 53.8 60.7 62.6 16.3 19.4 20.4
VCTree-L2+uKD (Wang et al., 2020a) 58.5 65.0 66.7 14.2 18.2 19.9
VCTree-L2+cKD (Wang et al., 2020a) 59.0 65.4 67.1 14.4 18.4 20.0
VCTree-TDE (Tang et al., 2020) 34.4 44.8 49.2 19.2 26.2 29.6
VCTree-PCPL (Yan et al., 2020) 50.5 56.9 58.7 18.7 22.8 24.5
VCTree-STL (Chen et al., 2019a) 57.1 65.2 67.0 14.3 21.4 23.5
VCTree-DLFE 45.7 51.8 53.5 20.8 25.3 27.1
Table 8. Recall and mean recall (with graph constraint) results in PredCls task on VG150. Models in the first section are with VGG backbone (Simonyan and Zisserman, 2014). , and are with the same meaning as in Table 1 of the main paper.
Scene Graph Classification
Model R@20 R@50 R@100 mR@20 mR@50 mR@100
IMP+ (Xu et al., 2017; Chen et al., 2019b) 31.7 34.6 35.4 - 5.8 6.0
FREQ (Zellers et al., 2018; Tang et al., 2019) 29.3 32.3 32.9 5.1 7.2 8.5
UVTransE (Hung et al., 2020) - 30.9 32.2 - - -
MOTIFS (Zellers et al., 2018; Tang et al., 2019) 32.9 35.8 36.5 6.3 7.7 8.2
KERN (Chen et al., 2019b) - 36.7 37.4 - 9.4 10.0
NODIS (Yuren et al., 2020) 36.0 39.8 40.7 - - -
HCNet (Ren et al., 2021) 34.2 36.6 37.3 - - -
VCTree (Tang et al., 2019) 35.2 38.1 38.8 8.2 10.1 10.8
GPS-Net (Lin et al., 2020) 36.1 39.2 40.1 10.0 11.8 12.6
GB-Net- (Zareian et al., 2020a) - 37.3 38.0 - 12.7 13.4
HOSE-Net (Wei et al., 2020) - 36.3 37.4 - - -
Part-Aware (Tian et al., 2020) 36.5 39.4 40.2 8.7 10.9 11.6
DG-PGNN (Khademi and Schulte, 2020) - 39.3 40.1 - - -
MOTIFS (Zellers et al., 2018; Tang et al., 2020) 36.4 39.5 40.3 7.2 8.9 9.4
MOTIFS-Focal (Lin et al., 2017b; Tang et al., 2020) 36.0 39.3 40.1 6.3 7.7 8.3
MOTIFS-Resample (Burnaev et al., 2015; Tang et al., 2020) 34.5 37.9 38.8 9.1 11.0 11.8
MOTIFS-Reweight 24.2 29.5 31.5 9.5 11.2 11.7
MOTIFS-L2+uKD (Wang et al., 2020a) 35.1 38.5 39.3 8.6 10.9 11.8
MOTIFS-L2+cKD (Wang et al., 2020a) 35.6 38.9 39.8 8.7 10.7 11.4
MOTIFS-TDE (Tang et al., 2020) 21.4 27.1 29.5 9.9 13.1 14.9
MOTIFS-PCPL (Yan et al., 2020) 31.9 35.3 36.1 9.9 12.0 12.7
MOTIFS-STL (Chen et al., 2019a) 35.4 39.9 40.9 8.5 12.8 14.1
MOTIFS-DLFE 29.0 32.3 33.1 12.8 15.2 15.9
VCTree (Tang et al., 2019, 2020) 42.1 45.8 46.8 9.1 11.3 12.0
VCTree-Reweight 38.0 42.3 43.5 10.6 12.5 13.1
VCTree-L2+uKD (Wang et al., 2020a) 40.9 44.7 45.6 9.9 12.4 13.4
VCTree-L2+cKD (Wang et al., 2020a) 41.4 45.2 46.1 9.7 12.4 13.1
VCTree-TDE (Tang et al., 2020) 21.7 28..8 32.0 11.2 15.2 17.5
VCTree-PCPL (Yan et al., 2020) 36.5 40.6 41.7 12.7 15.2 16.1
VCTree-STL (Chen et al., 2019a) 40.6 45.7 46.9 10.5 14.6 16.6
VCTree-DLFE 29.7 33.5 34.6 15.8 18.9 20.0
Table 9. Recall and mean recall (with graph constraint) results in SGCls task on VG150. Models in the first section are with VGG backbone (Simonyan and Zisserman, 2014). , and are with the same meaning as in Table 1 of the main paper.
Scene Graph Detection
Model R@20 R@50 R@100 mR@20 mR@50 mR@100
IMP+ (Xu et al., 2017; Chen et al., 2019b) 14.6 20.7 24.5 - 3.8 4.8
FREQ (Zellers et al., 2018; Tang et al., 2019) 20.1 26.2 30.1 4.5 6.1 7.1
UVTransE (Hung et al., 2020) - 25.3 28.5 - - -
MOTIFS (Zellers et al., 2018; Tang et al., 2019) 21.4 27.2 30.3 4.2 5.7 6.6
KERN (Chen et al., 2019b) - 27.1 29.8 - 6.4 7.3
NODIS (Yuren et al., 2020) 21.5 27.4 30.7 - - -
HCNet (Ren et al., 2021) 22.6 28.0 31.2 - - -
VCTree (Tang et al., 2019) 22.0 27.9 31.3 5.2 6.9 8.0
GPS-Net (Lin et al., 2020) 22.6 28.4 31.7 6.9 8.7 9.8
GB-Net- (Zareian et al., 2020a) - 26.3 29.9 - 7.1 8.5
HOSE-Net (Wei et al., 2020) - 28.9 33.3 - - -
Part-Aware (Tian et al., 2020) 23.4 29.4 32.7 5.7 7.7 8.8
DG-PGNN (Khademi and Schulte, 2020) - 31.2 32.5 - - -
MOTIFS (Zellers et al., 2018; Tang et al., 2020) 25.8 33.1 37.6 5.3 7.3 8.6
MOTIFS-Focal (Lin et al., 2017b; Tang et al., 2020) 24.7 31.7 36.7 3.9 5.3 6.6
MOTIFS-Resample (Burnaev et al., 2015; Tang et al., 2020) 23.2 30.5 35.4 5.9 8.2 9.7
MOTIFS-Reweight 18.3 24.4 29.3 6.7 9.2 10.9
MOTIFS-L2+uKD (Wang et al., 2020a) 24.8 32.2 36.8 5.7 7.9 9.5
MOTIFS-L2+cKD (Wang et al., 2020a) 25.2 32.5 37.1 5.8 8.1 9.6
MOTIFS-TDE (Tang et al., 2020) 12.4 17.3 20.8 6.7 9.2 11.1
MOTIFS-PCPL (Yan et al., 2020) 21.3 27.8 31.7 8.0 10.7 12.6
MOTIFS-STL (Chen et al., 2019a) 22.5 29.9 34.9 5.4 7.6 9.1
MOTIFS-DLFE 18.9 25.4 29.4 8.6 11.7 13.8
VCTree (Tang et al., 2019, 2020) 24.1 30.8 35.2 5.2 7.1 8.3
VCTree-Reweight 20.8 27.8 32.0 6.6 8.7 10.1
VCTree-L2+uKD (Wang et al., 2020a) 24.4 31.6 35.9 5.7 7.7 9.2
VCTree-L2+cKD (Wang et al., 2020a) 24.8 32.0 36.1 5.7 7.7 9.1
VCTree-TDE (Tang et al., 2020) 12.3 17.3 20.9 6.8 9.5 11.4
VCTree-PCPL (Yan et al., 2020) 20.5 26.6 30.3 8.1 10.8 12.6
VCTree-STL (Chen et al., 2019a) 21.6 28.8 33.6 5.1 7.1 8.4
VCTree-DLFE 16.8 22.7 26.3 8.6 11.8 13.8
Table 10. Recall and mean recall (with graph constraint) results in SGDet task on VG150. Models in the first section are with VGG backbone (Simonyan and Zisserman, 2014). , and are with the same meaning as in Table 1 of the main paper.

E. Additional Qualitative Results

We present additional results on the change of confidence distribution after applying DLFE to MOTIFS in Figure 10. For table-?-window in example (a), instead of the prediction of near by MOTIFS, MOTIFS-DLFE predicts a more descriptive in front of which matches the ground truth. The same applies to bird-?-pole in example (b) where MOTIFS-DLFE’s standing on is better than MOTIFS’s on. (c) is an interesting example that, while the top prediction on by MOTIFS-DLFE is the same as that of MOTIFS, more descriptive predicates such as mounted on, attached to are assigned with the higher confidence scores by MOTIFS-DLFE. This fact should make it easier for, e.g., non-graph constraint mean recall (ng-mR@), to recall these fine-grained predicates Finally, car-?-street is another example that MOTIFS-DLFE produces more descriptive parked on rather than on; however, due to the missing labels parked on is not in the ground truth.

Figure 10. Confidence distributions over the predicates, produced by MOTIFS (top of each example) and MOTIFS with DLFE (bottom). Green, red and tangerine predicates denote correct (GT), incorrect (Non-GT and weird) and acceptable (Non-GT but reasonable), respectively.