Scene Graph Prediction with Limited Labels
Visual knowledge bases such as Visual Genome power numerous applications in computer vision, including visual question answering and captioning, but suffer from sparse, incomplete relationships. All scene graph models to date are limited to training on a small set of visual relationships that have thousands of training labels each. Hiring human annotators is expensive, and using textual knowledge base completion methods are incompatible with visual data. In this paper, we introduce a semi-supervised method that assigns probabilistic relationship labels to a large number of unlabeled images using few labeled examples. We analyze visual relationships to suggest two types of image-agnostic features that are used to generate noisy heuristics, whose outputs are aggregated using a factor graph-based generative model. With as few as 10 labeled relationship examples, the generative model creates enough training data to train any existing state-of-the-art scene graph model. We demonstrate that our method for generating training data outperforms all baseline approaches by 5.16 recall@100. Since we only use a few labels, we define a complexity metric for relationships that serves as an indicator (R^2 = 0.778) for conditions under which our method succeeds over transfer learning, the de-facto approach for training with limited labels.READ FULL TEXT VIEW PDF
Scene graph prediction --- classifying the set of objects and predicates...
Scene graph generation aims to identify objects and their relations in
One of the most difficult tasks in scene understanding is recognizing
Scene graph generation has received growing attention with the advanceme...
In this manuscript, we introduce a semi-automatic scene graph annotation...
Driven by successes in deep learning, computer vision research has begun...
Neural language models are a powerful tool to embed words into semantic
Scene Graph Prediction with Limited Labels
In an effort to formalize a representation for images, Visual Genome  defined scene graphs, a structured formalization that is similar to the form widely used to represent knowledge bases [18, 13, 56]. Scene graphs encode objects (e.g. dog, frisbee) as nodes connected via pairwise relationships (e.g., playing with) as edges. This formalization has led to state-of-the-art models in image captioning 25, 42], visual question answering , relationship modeling  and image generation . However, all scene graph models designed so far ignore more than of relationship categories that do not have sufficient labeled instances (see Figure 2) and instead focus on modeling the few relationships that have thousands of labels [49, 31, 54].
Hiring more human workers is an ineffective solution to labeling relationships because image annotation is so tedious that seemingly obvious labels are left unannotated. To complement human annotators, traditional text-based knowledge completion tasks have explored numerous semi-supervised or distant supervision approaches [7, 6, 17, 34]. These methods find syntactical or lexical patterns from a small labeled set to extract missing relationships from a large unlabeled set. In text, pattern-based methods are successful, as relationships in text are usually document-agnostic (e.g. Tokyo - is capital of - Japan). In vision, images and visual relationships are often incidental: they depend on the contents of the particular image they appear in. Therefore, methods that rely on knowledge from external sources or on patterns over concepts (e.g. most instances of dog next to frisbee are playing with it) do not generalize well for visual relationships. The inability to utilize the progress in text-based methods necessitates specialized methods for visual knowledge.
|Num. Labeled ()||200||175||150||125||100||75||50||25||10||5|
In this paper, we automatically generate missing relationships labels from a small limited set and use these automatically generated labels to train downstream scene graph models (see Figure 1). We begin by exploring how to define image-agnostic features for relationships such that they follow patterns across images. For example, eat usually consists of an object eating another object smaller than itself, and for look, the objects in the relationship are usually a phone, laptop, or window (see Figure 3
). These rules do not require raw pixel values and can be derived from image-agnostic features like object categories and spatial relations between the objects in a relationship. While these rules are simple, their capacity to find missing relationships has been unexplored. We show that while these image-agnostic features can capture the variance of some visual relationships, they fail to characterize complex relationships with a high variance. So, to quantify the efficacy of our image-agnostic features, we define “subtypes” that measure spatial and categorical variance (Section3).
Based on our analysis, we propose a semi-supervised approach that takes advantage of image-agnostic features to label missing relationships using as few as labeled instances of each relationship. We learn simple heuristics over these features and assign probabilistic labels to the unlabeled images using a generative model [39, 46]. We test our method’s ability to annotate the already complete VRD dataset  and finds that it achieves an F1 score of , which is points higher than other standard semi-supervised methods like label propagation . To demonstrate the utility of our generated labels, we train a state-of-the-art scene graph model  (see Figure 6
) and modify its loss function to operate over our probabilistic labels. Our approach achieves47.53 recall@100111Recall@ is a standard measure for scene graph prediction . for predicate classification on Visual Genome, improving over the same model trained only on the labeled instances by 40.97 points. Our approach achieves within 8.65 recall@100 for scene graph detection compared to the same model trained on a fully human annotated Visual Genome dataset with more labeled data. We end by comparing our approach to transfer learning, the de-facto choice for learning from a limited labeled data. We find that our approach performs 5.16 points recall@100 better because it generalizes well to unlabeled subtypes, especially for relationships with high complexity.
Our contributions are three-fold. (1) We introduce the first method to complete visual knowledge bases by finding missing visual relationships (Section 5.1). (2) We show the utility of our generated labels in training existing scene graph prediction models (Section 5.2). (3) We introduce a metric to characterize the variance of visual relationships and show it is a strong indicator () for our semi-supervised method’s improvements over transfer learning (Section 5.3).
Textual knowledge bases were originally hand-curated by experts to structure facts [5, 44, 4] such as Tokyo - is capital of - Japan. To scale dataset curation, recent approaches either mine knowledge directly from the web  or hire non-expert annotators to manually curate knowledge [47, 5]. Semi-supervised solutions to this problem use a small amount of labeled text to extract patterns and identify relationships in unlabeled sentences [37, 34, 35, 33, 21, 2]. However, these approaches cannot be directly applied to extract visual relationships since textual relations can be captured via external knowledge or lexical patterns while visual relationships are local to an image.
Visual relationships have been studied as spatial priors [14, 16], co-occurrences , language statistics [53, 31, 28], and within entity contexts . Scene graph prediction models have dealt with the difficulty of learning from incomplete knowledge, as recent methods utilize statistical motifs  or object-relationship dependencies [49, 30, 50, 55]. All these methods limit their inference to the top most frequently occurring predicate categories and ignore those without enough labeled examples (Figure 2).
The de-facto solution for problems with few examples is transfer learning [15, 52], which requires that the source domain used for pre-training follows a similar distribution of data as the target domain. In our case, the target domain is a set of limited labeled relationships. The source domain is a dataset of frequently labeled common relationships with thousands of examples [49, 30, 50, 55]. We find that despite similar objects in the source and target domain, transfer learning has difficulty generalizing to new relationships. Instead of relying on transfer learning, we propose a semi-supervised method that can use a small labeled set to annotate a larger unlabeled set of images. Unlike transfer learning, our method does not rely on the availability of a larger, labeled set of relationships.
To address the issue of gathering enough training labels for complex machine learning models,data programming has emerged as a popular paradigm. It models imperfect labeling sources to effectively assign noisy training labels to unlabeled data. Imperfect labeling sources can come from crowdsourcing , user-defined heuristics [8, 43], multi-instance learning [22, 40], and distant supervision [12, 32]. Often, these imperfect labeling sources take advantage of domain expertise from the user. In our approach, we specifically explore how techniques from data programming can be used to aggregate several simple heuristics to assign probabilistic labels to unlabeled data. In our case, imperfect labeling sources are automatically generated heuristics, which we aggregate over to assign a final probabilistic label to every pair of object proposals.
In this section, we define the formal terminology used in the rest of the paper and the image-agnostic features that our semi-supervised method relies on. Then, we seek quantitative insights into how visual relationships can be described by the properties between its objects. In particular, we ask (1) what features can characterize visual relationships while being image-agnostic? and given that we use a few limited labels, (2) how well do our chosen features characterize the variance of relationships? These insights motivate our model design to generate heuristics that do not overfit to the small amount of labeled data and assign accurate labels to the larger, unlabeled set.
A scene graph is a multi-graph that consists of objects and relationships as edges. Each object consists of a bounding box and its category where is the set of all possible object categories, such as dog, person, etc. Relationships are denoted subject - predicate - object or - - . and are referred to as the subject and the object. is a predicate, such as ride and eat. We assume that we have a small labeled set of annotated relationships for each predicate . Usually, these datasets are on the order of a examples or fewer. For our semi-supervised approach, we also assume that there exists a large set of images without any labeled relationships.
It has become common in computer vision to utilize pretrained convolutional neural networks to extract features to represent objects and visual relationships[49, 31, 50]. These features, though proven robust in the presence of enough training labels, tend to overfit when presented with limited data (Section 5
). Therefore, an open question arises: what other features can we utilize to label relationships with limited data? Previous literature has combined deep learning features with extra information extracted from categorical object labels and relative spatial object locations[25, 31]. We define categorical features of a candidate relationship,
, as a concatenation of one-hot vectors of the subjectand object . We define spatial features as:
where and are the top-left bounding box coordinates and their widths and heights.
To explore how well spatial and categorical features can describe different visual relationships, we train a simple decision tree model for each relationship. We plot the importances for the topspatial and categorical features in Figure 3. Relationships like fly place high importance on the difference in y-coordinate between the subject and object, capturing a characteristic spatial pattern. look, on the other hand, depends on the category of the objects (e.g. phone, laptop, window) and not on any spatial orientations.
Since we only have a few ground truth labels, it’s important to measure how well the image-agnostic features we defined above can characterize a relationship. To systematically capture the notion of variance of a relationship, we define each relationship to contain a certain number of subtypes. Each subtype captures one way that a relationship manifests in our dataset. For example, in Figure 4, ride contains one categorical subtype with person - ride - bike and another with dog - ride - surfboard. Similarly, a person might carry an object in different relative spatial orientations (e.g. on her head, to her side). As shown in Figure 5, visual relationships have significantly different degrees of spatial and categorical variance, and therefore a different number of subtypes for each. To find all the spatial subtypes, we use mean shift clustering 
over the spatial features extracted from all the relationships in Visual Genome. To find the categorical subtypes, we count the number of object categories involved with a relationship.
With access to or fewer labeled instances for these visual relationships, it is impossible to capture all the subtypes and therefore learn a good representation for the relationship as a whole. Therefore, we can turn to the rules extracted from image-agnostic features and assign labels to the unlabeled data to try to capture all the subtypes in a visual relationship. We posit that this will be advantageous over methods that only use the small labeled set to train a scene graph prediction model, especially for relationships with high complexity, or a large number of subtypes. We find a correlation between our definition of complexity and the performance of our semi-supervised method (Section 5.3).
We aim to automatically annotate missing visual relationships with a few labeled examples that can later be used to train any scene graph prediction model. We assume that we have a small labeled set of annotated relationships for each predicate . Usually, these datasets are on the order of a examples or less. As discussed in Section 3, we want to use image-agnostic features to learn image-agnostic rules to annotate unlabeled relationships.
Our approach assigns probabilistic labels to a set of un-annotated images in three steps: (1) we extract image-agnostic features from the objects in the labeled and from the object proposals extracted using an existing object detector  on unlabeled , (2) we generate heuristics over the image-agnostic features, and finally (3) we use a factor-graph based generative model to aggregate and assign probabilistic labels to the unlabeled object pairs in . These probabilistic labels, along with , are used to train any scene graph prediction model. We describe our approach in Algorithm 1 and show the end-to-end pipeline in Figure 6.
Feature extraction: Our approach uses the image-agnostic features defined in Section 3, which rely on object bounding box and category labels. The features are extracted from ground truth objects in or from object detection outputs in by running existing object detection models .
Heuristic generation: We fit decision trees over the labeled relationships’ spatial and categorical features to capture image-agnostic rules that define a relationship. These image-agnostic rules are threshold-based conditions that are automatically defined by the decision tree. We limit the complexity of these heuristics to prevent overfitting by using shallow decision trees  with different restrictions on depth over each feature set, resulting in different decision trees. We then predict labels for the unlabeled set using these heuristics, resulting in matrix of predictions for the unlabeled relationships.
To further prevent overfitting and only use these heuristics when they have high confidence about their label, we modify by converting any predicted label with confidence less than a threshold (empirically chosen random) to an abstain, or no label. The resulting heuristics look like the example in Figure 6, an if statement that only checks if the subject is above the object and then assigns a positive label for the predicate carry.
Generative model: As expected, these heuristics, individually, are noisy and may not assign labels to all object pairs in . To aggregate the labels from all heuristics, we use a factor graph-based generative model popular in text-based weak supervision techniques [48, 39, 45, 41, 1]. This model learns the accuracies of each heuristic and combines their individual labels to assign probabilistic labels.
The generative model uses the following distribution family to relate the latent variable , the true class and the labels from the heuristics, :
where is a partition function to ensure is normalized. The parameter
encodes the average accuracy of each heuristic and is estimated by maximizing the marginal likelihood of the observed heuristic. The generative model assigns probabilistic labels by computing for each object pair in .
Training scene graph model: Finally, these probabilistic labels are used to train any scene graph prediction model. While scene graph models are usually trained using a cross-entropy loss [49, 31, 54]
, we modify this loss function to take into account errors in the training annotations. We adopt a noise-aware empirical risk minimizer that is often seen in logistic regression as our loss function:
where is the learned logistic regression parameters, is the distribution learned by the generative model, is the true label, and is a set of visual relationship features extracted by any scene graph prediction model.
|Ours (Majority Vote)||55.01||57.26||56.11||40.04|
|Ours (Categ. + Spat.)||54.83||60.79||57.66||50.31|
To test our semi-supervised approach to completing visual knowledge bases by annotating missing relationships, we perform a series of experiments and evaluate the model along multiple dimensions. We start by discussing the datasets, baselines, evaluation metrics used. (1) Our first experiment tests our generative model’s ability to find missing relationships in the completely annotated VRD dataset. (2) Our second experiment demonstrates the utility of our generated labels by using them to train a state-of-the-art scene graph model . We compare our labels to those from the large Visual Genome dataset . (3) Finally, to show that our semi-supervised method performs well even when compared against transfer learning, we focus on a subset of relationships with limited labels and allow the transfer learning model to pretrain on frequent relationships. We demonstrate that our semi-supervised method outperforms transfer learning, which has seen more data. Furthermore, we quantify when our method outperforms transfer learning using our metric for measuring relationship complexity.
|Scene Graph Detection||Scene Graph Classification||Predicate Classification|
|Decision tree ||11.11||12.58||13.23||14.02||14.51||14.57||31.75||33.02||33.35|
|Label propagation ||6.48||6.74||6.83||9.67||9.91||9.97||24.28||25.17||25.41|
|Ours (Categ. + Spat. + Deep)||7.33||7.70||7.79||17.03||17.35||17.39||38.90||39.87||40.02|
|Ours (Majority Vote)||16.86||18.31||18.57||18.96||19.57||19.66||44.18||45.99||46.63|
|Ours (Categ. + Spat.)||17.67||18.69||19.28||20.91||21.34||21.44||45.49||47.04||47.53|
Eliminating synonyms and supersets. Typically, past scene graph models have used predicatesZ from Visual Genome to study visual relationships. Unfortunately, these treat synonyms like laying on and lying on as separate classes. To make matters worse, some predicates are a superset of others, such as above is a superset of riding. Our method, as well as the baselines, are unable to learn to differentiate between synonyms and supersets. While we report how well all these methods perform on all predicates in our supplementary material, we eliminate all supersets and merge all synonyms, resulting in unique predicates for the experiments reported in this section. Please see our supplementary section for a list of synonyms and supersets.
Dataset. There are two standard datasets that people use to evaluate on tasks related to visual relationships or scene graphs: VRD  and Visual Genome . Each scene graph in either dataset contains objects localized as bounding boxes in the image along with pairwise relationships connecting them, which are categorized as action (e.g., carry), possessive (e.g., wear), spatial (e.g., above), or comparative descriptors (e.g., taller than). Visual Genome is a large visual knowledge base containing images. Unfortunately, due to its scale, each scene graph is left with incomplete incomplete scene graph labels, making it difficult to measure the precision of our semi-supervised algorithm. VRD, on the other hand is a smaller but completed annotated dataset. To show the performance of our semi-supervised method in generating training data, we will measure performance on the VRD dataset (Section 5.1) and later, to show that the training labels produced can be used to train a large scale scene graph prediction model, we will use Visual Genome (Section 5.2).
To evaluate how well our semi-supervised method annotates missing visual relationships, we measure precision and recall on the VRD dataset’s test set (Section5.1). To show the utility of the probabilistic labels, we modify an existing scene graph model with our loss function and evaluate it with the three standard evaluation modes for scene graph prediction : (i) scene graph detection (SGDET) which expects only input images and predicts bounding box locations, object categories, and predicate labels, (ii) scene graph classification (SGCLS) which expects ground truth boxes and predicts object categories and predicate labels, and (iii) predicate classification (PREDCLS), which expects a ground truth set of bounding boxes, object categories and predicts predicate labels. We refer the reader to the paper that introduced these three tasks for more details on our evaluation modes . Finally, we explore how relationship complexity, measured using our definition of subtypes, is correlated with our model’s performance relative to transfer learning (Section 5.3).
Baselines. We compare against alternative methods for generating training labels for training downstream scene graph models. The oracle model serves as the upper bound for how well we expect to perform, because it is trained on all of Visual Genome, which amounts to the labeled relationships in . Next, a Decision tree  model is trained by fitting a single decision tree over the image-agnostic features. It learns from labeled examples in and assigns labels to . Label propagation  employs a widely-used semi-supervised method, label propagation  and considers the distribution of image-agnostic features in before propagating labels from to .
To ensure that our study is complete, we include a Transfer Learning baseline, which is the de-facto choice for training models with limited data [15, 52]. However, unlike all other methods, transfer learning requires a source dataset to pretrain. We treat the source domain as the remaining relationships from the top in Visual Genome that do not overlap with our chosen relationships. We then fine tune with the limited labeled examples for the predicates in . Our transfer learning baseline has an unfair advantage because of the overlap in objects between its source and target relationship sets. Our experiments will show that even with this advantage, our method performs better.
Ablations. We ablate the image-agnostic features with deep learning features and consider the efficacy of our generative model. (Categ.) uses only categorical features, (Spat.) uses only spatial features, (Deep) uses only deep learning features extracted using ResNet50  from the union of the object pair’s bounding boxes, (Categ. + Spat.) uses both categorical concatenated with spatial features, and (Categ. + Spat. + Deep) combines combines all three. (Majority Vote) uses the categorical and spatial features but replaces our generative model with a simple majority voting scheme to aggregate heuristic function outputs.
We evaluate our performance in annotating missing relationships in . Before we use these labels to train scene graph prediction models, we report results comparing our method to baselines in Table 1. On the fully annotated VRD dataset , Ours (Categ. + Spat.) achieves recall given only labeled examples, which is , , and points better than Label Propagation, Decision Tree and Majority Vote, respectively.
Qualitative error analysis. We visualize labels assigned by Ours in Figure 7 and find that they correspond to image-agnostic rules explored in Figure 3. In Figure 7(a), Ours predicts fly because it learns that fly typically involves objects that have a large difference in y-coordinate. In Figure 7(b), we correctly label look because phone is an important categorical feature. In some difficult cases, our semi-supervised model fails to generalize beyond the image-agnostic features. In Figure 7(c), we mislabel hang as sit by incorrectly relying on the categorical feature chair, which is one of sit’s important features. In Figure 7(d), ride typically occurs directly above another object that is slightly larger and assumes book - ride - shelf instead of book - sitting on - shelf. In Figure 7
(e), our model reasonably classifiesglasses - cover - face. However, sit exhibits the same semantic meaning as cover in this context, and our model makes an incorrect classification as a result of dataset bias.
We compare our method’s labels to those generated by the baselines described earlier by using them to train three scene graph specific tasks and report results in Table 2. We improve over all baselines, including our primary baseline, Transfer Learning, by recall@100 for PREDCLS. We also achieve within recall@100 of Oracle for SGDET. We generate higher quality training labels than Decision Tree and Label Propagation, leading to an and increase in recall@100 points for PREDCLS.
Effect of labeled and unlabeled data. In Figure 8 (left two graphs), we visualize how scene graph and predicate classification performance varies as we reduce the number of labeled examples from to . We observe greater advantages over Transfer Learning as decreases, with an increase of recall@100 PREDCLS when . This result matches our observations from Section 3 because a larger set of labeled examples gives Transfer Learning information about a larger proportion of subtypes for each relationship. In Figure 8 (right two graphs), we visualize our method’s performance as the number of unlabeled data points increase, finding that we approached Oracle performance as the number of unlabeled images increases.
Ablations. Ours (Categ. + Spat. + Deep.) hurts performance by up to recall@100 for PREDCLS because it overfits to image features while Ours (Categ. + Spat.) performs the best. We show improvements of recall@100 for SGDET over Ours (MajorityVote), indicating that the generated heuristics indeed have different accuracies and should be weighted differently.
Inspired by the recent work comparing transfer learning and semi-supervised learning, we sought to determine when our method is preferred over transfer learning. By defining a variance metric based on spatial and categorical subtypes of each predicate (Section 3), we show this trend in Figure 9. We find that when the predicate in question has a high variance (as measured by a large number of subtypes), Ours (Categ. + Spat.) outperforms Transfer Learning (Figure 9, left), with correlation coefficient .
In addition, we compare how the number of subtypes in the unlabeled set () affects the performance of our model (Figure 9, center). We find a strong correlation (), which can be explained by the fact that as our method assigns labels to an increasing number of subtypes, the scene graph prediction model has access to data with higher variance to learn from.
We also compare the difference in performance to the proportion of subtypes captured in the labeled set (Figure 9, right). As we hypothesized earlier, Transfer Learning suffers in cases when the labeled set only captures a small portion of the relationship’s subtypes. This trend () explains how Ours (Categ. + Spat.) performs better when given a small portion of labeled subtypes.
We introduce the first method that completes visual knowledge bases like Visual Genome by finding missing visual relationships. We define categorical and spatial features as image-agnostic features and introduce a factor-graph based generative model that uses these features to assign probabilistic labels to unlabeled images. Our method outperforms baselines in precision and recall when finding missing relationships in the complete VRD dataset. Our labels can also be used to train scene graph prediction models with minor modifications to their loss function to accept probabilistic labels. We outperform transfer learning and other baselines and come close to oracle performance of the same model trained on more labeled data. Finally, we introduce a metric to characterize the variance of visual relationships and show it is a strong indicator of how our semi-supervised method performs compared to baselines.
We thank Shyamal Buch, Ines Chami, Iro Armeni, Megan Leszczynski, and Jian Zhang for their helpful comments and suggestions. This work was partially funded by the Brown Institute of Media Innovation and the Toyota Research Institute (“TRI”) but this article solely reflects the opinions and conclusions of its authors and not TRI or any other Toyota entity. This research is also sponsored by Defense Advanced Research Projects Agency (DARPA) under agreement number FA8750-17-2-0095. We gratefully acknowledge the support of the DARPA SIMPLEX program under No. N66001-15-C-4043, DARPA FA8750-12-2-0335 and FA8750-13-2-0039, DOE 108845, the Moore Foundation, National Institute of Health (NIH) U54EB020405, the Office of Naval Research (ONR) under awards No. N000141210041 and No. N000141310129, the National Science Foundation (NSF) Graduate Research Fellowship under No. DGE-114747, Joseph W. and Hon Mai Goodman Stanford Graduate Fellowship, the Moore Foundation, the Okawa Research Grant, American Family Insurance, Accenture, Toshiba, and Intel. This research was supported in part by affiliate members and other supporters of the Stanford DAWN project: Ant Financial, Facebook, Google, Infosys, Intel, Microsoft, NEC, Teradata, SAP, VMWare. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright notation thereon. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of DARPA or the U.S. Government. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of DARPA, AFRL, NSF, NIH, ONR, or the U.S. government.
2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3298–3308. IEEE, 2017.
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 397–406, 2014.
Deep variation-structured reinforcement learning for visual relationship and attribute detection.In Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on, pages 4408–4417. IEEE, 2017.