Generative Graph Perturbations for Scene Graph Prediction

07/11/2020 ∙ by Boris Knyazev, et al. ∙ 0

Inferring objects and their relationships from an image is useful in many applications at the intersection of vision and language. Due to a long tail data distribution, the task is challenging, with the inevitable appearance of zero-shot compositions of objects and relationships at test time. Current models often fail to properly understand a scene in such cases, as during training they only observe a tiny fraction of the distribution corresponding to the most frequent compositions. This motivates us to study whether increasing the diversity of the training distribution, by generating replacement for parts of real scene graphs, can lead to better generalization? We employ generative adversarial networks (GANs) conditioned on scene graphs to generate augmented visual features. To increase their diversity, we propose several strategies to perturb the conditioning. One of them is to use a language model, such as BERT, to synthesize plausible yet still unlikely scene graphs. By evaluating our model on Visual Genome, we obtain both positive and negative results. This prompts us to make several observations that can potentially lead to further improvements.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction and Related Work

Reasoning about the world in terms of objects and relationships between them is an important aspect of human and machine cognition. This allows for the ability to generalize to novel compositions of concepts (Atzmon et al., 2016; Johnson et al., 2017; Bahdanau et al., 2018; Keysers et al., 2019; Lake, 2019) and for interpretability of machine vision and reasoning (Norcliffe-Brown et al., 2018; Hudson & Manning, 2019a). When learning to solve image understanding tasks, the model can be exposed to compositions such as “person riding a horse” and “person next to a bike”. Then, at test time, to accurately recognize a novel composition “person riding a bike”, the model needs to understand the concepts of ‘person’, ‘bike’, ‘horse’ and ‘riding’. Recognizing such novel compositions might be possible by capturing and inappropriately exploiting statistical correlations in the data, e.g. ‘riding’ has always occurred in an outdoor landscape or people have always appeared wearing helmets when riding. Unsurprisingly, more careful compositional generalization tests have shown such models can fail remarkably (Atzmon et al., 2016; Lu et al., 2016; Tang et al., 2020; Knyazev et al., 2020), e.g. recall can drop from 30% to 4.5% (Figure 1).

Figure 1: (left) The triplet distribution in Visual Genome is extremely long-tailed, with numerous zero-shots (i.e. observed only at test time). (right) The training set contains a tiny fraction of all possible triplets. We argue that a large fraction of the triplets missing in the dataset are quite plausible compositions. We aim to “hallucinate” them using a GAN to increase the diversity of training samples and improve generalization. The recall results (R@100) from a recent work of Tang et al. (2020) highlight a severe drop in recall on zero-shots, making their appearance problematic.

One of the main approaches to tackling this problem is to explicitly introduce an inductive bias of compositionality in the form of translation operators (Zhang et al., 2017), decoupling object and predicate features (Yang et al., 2018) or constructing causal graphs (Tang et al., 2020).

However, another possible approach, still underexplored in compositional image understanding, is exposing the model to a large diversity of training examples that will lead to emergent generalization (Hill et al., 2019; Ravuri & Vinyals, 2019). To avoid expensive labeling of additional data required to increase the diversity, we consider a generative approach, in particular, generative adversarial networks (GANs) (Goodfellow et al., 2014). Recently, GANs have been significantly improved w.r.t. stability of training and the quality of generated samples (see BigGAN (Brock et al., 2018)), but their usage for data augmentation sill remains limited (Ravuri & Vinyals, 2019). One of the ways to address this limitation is to learn the factors of variation in the data (Chen et al., 2016), such that out-of-distribution (OOD) samples useful for augmentation can be created by conditioning on unseen combinations of factors. Indeed, recent work has shown that it is possible to produce plausible OOD examples conditioned on unseen label combinations, by intervening on the underlying graph (Kocaoglu et al., 2017). In this work, we have direct access to underlying graphs of images in the form of scene graphs. By perturbing these graphs in a certain way, we propose to create OOD compositions, so that a GAN conditioned on them will be encouraged to generate diverse OOD samples.

The closest work that also considers a generative approach for visual relationship detection is (Wang et al., 2019).

Compared to their work, where they condition on triplets (i.e. <subject, predicate, object>, such as <person, riding, horse> ), we condition a GAN on scene graphs, which combinatorially increases the number of possible augmentations. Conditioning on a whole scene graph is also beneficial to alleviate the creation of totally implausible scenarios: by randomly perturbing only part of the scene graph, we maintain its overall likelihood. Next, we condition on object and predicate categories, rather than visual features of training images. This avoids the issue of mismatched features (coming from different contexts), which can degrade the quality of generated features. Our model also follows a standard scene graph classification pipeline (Xu et al., 2017; Zellers et al., 2018)

including object and predicate classification, instead of classifying only the predicate, which enables a more comprehensive study of compositional generalization 

(Bahdanau et al., 2018). Finally, we train the generative and classification models jointly end-to-end (Figure 3).

Generating images conditioned on a structured input, such as a scene graph or a bounding boxes layout has been explored in several recent works, summarized in Table 1. In this work, we rely on (Johnson et al., 2018) to generate visual features given a scene graph .

Boxes layout Relationships
Sg2Im (Johnson et al., 2018) Determ. given or Inferred from or
InterSg2Im (Mittal et al., 2019) Determ. given Spatial only
SOARISG (Ashual & Wolf, 2019) Provided by user Spatial only
OC-GAN (Sylvain et al., 2020) Provided by user Spatial only
Layout2Im (Zhao et al., 2019) Provided by user Not supported
LostGAN (Sun & Wu, 2019) Provided by user Not supported
SPADE (Park et al., 2019) Masks provided by user Not supported
Caption2Im (Hong et al., 2018) Determ. given Inferred from
Ours Ground truth given Inferred from
Table 1: Summary of generative models conditioned on objects (and relationships between them). - scene graph, - caption. Gray denotes partially appropriate for our work, because we would like to increase the diversity of samples by generating the layout stochastically given . Bolded denotes appropriate for our work.

Other works focus on improving the quality of generated images by avoiding generating the layout or avoiding conditioning on a full spectrum of relationships. We show that using a layout is essential in our task and we leverage the ground-truth one instead of generating it, because (1) the layout generated by the model of Johnson et al. (2018) was often implausible in case of OOD conditioning (Figure 2); (2) in our experience, joint training of layout generation with the rest of the model was unstable. The limitations of our approach are discussed in Section 3.

cup on table person riding elephant cup under table clock under table giraffe riding elephant giraffe wears shirt
Figure 2: Examples of in-distribution and out-of-distribution (bolded) layouts generated by a pretrained model from Johnson et al. (2018). The first object is in blue and the second object is in red. We can observe that the model struggles to generate plausible layouts for certain OOD conditioning (denoted by ✗).

2 Model

We use conditional generative adversarial networks (CGAN) (Mirza & Osindero, 2014) (Figure 3). We have also considered ACGAN (Odena et al., 2017), but we found its performance significantly inferior to CGAN in our task.

Figure 3: Our generative scene graph augmentation pipeline with its main components: discriminators , a generator and a scene graph classification model . The generative components are detailed in Figure 4. A conditioning class of objects and predicates is passed to the discriminators via CGAN (Mirza & Osindero, 2014), which is not shown to avoid clutter. See §2 for notation.

2.1 Scene Graph Classification Model

During training, we are given tuples of an image and the corresponding ground truth scene graph consisting of objects and relationships between them , as well as bounding boxes for all objects. Following Xu et al. (2017), we use a pretrained object detector (Ren et al., 2015) to extract global visual features and, given , visual features of nodes and edges respectively.

Figure 4:

Our GAN model in more detail. We use the same graph convolutional network (GCN) and feature interpolation (gray blocks) as in 

(Johnson et al., 2018), which we refer to for additional details. Our GCN has five fully-connected layers with 64 hidden units in each. We refine coarse feature maps with five convolutional layers, gradually increasing the number of channels from 64 to 512 to match the dimensionality of real feature maps . Finally, we extract node and edge features using RoIAlign (He et al., 2017). Following (Radford et al., 2015)

, all discriminators are regularized by having a fully-convolutional architecture with four layers interleaved with ReLU.

Figure 5: Example of BERT-style perturbations with and without context. We show BERT returns tokens appropriate to the context, which can help to keep perturbed scene graphs plausible. A scene graph is perturbed recursively until a specified ratio () of perturbed nodes and edges is reached. The scene graph is taken from original Visual Genome (Krishna et al., 2017) for visualization purposes.

Features are extracted from one of the last convolutional layers, while and are extracted from using RoIAlign (He et al., 2017) given bounding boxes and their unions (for edges) respectively. We follow Xu et al. (2017) and do not update the detector during training, and only solve the tasks that assume the ground truth bounding boxes are available instead of proposals. Given visual features , the message passing (MP) model of Xu et al. (2017) is trained to predict a scene graph , i.e. we need to correctly assign object labels to node features and predicate classes to edge features . This completes our baseline model.

2.2 Generative Model

To augment the baseline model, we generate novel visual features conditioned on a scene graph obtained by perturbing a ground truth . Analogous to the baseline model, given augmented features we train to predict , so that such augmentation can improve the performance of at test time. Nodes and edges in are represented as Glove word embeddings (Pennington et al., 2014)

. The noise vector

is injected at each node and edge in to make sure different features are generated for the same . The generator is implemented as a graph convolutional network, followed by layout construction and feature refinement (Johnson et al., 2018), which we explain in Figure 4. We have independent discriminators for nodes and edges, and , that discriminate real features (, ) from fake ones (, ) conditioned on their class as per the CGAN. We also add a global discriminator acting on feature maps , which encourages global consistency between nodes and edges. Thus, and are trained to match marginal distributions, while

is trained to match the joint distribution. The intuition is that the right balance between these discriminators should enable the generation of realistic visual features conditioned on OOD scene graphs.

2.3 Scene Graph Perturbations

We experiment with two perturbations: random and BERT-based (Devlin et al., 2018). In both cases, for a scene graph with nodes and edges, we randomly choose nodes and edges for which we change the category, where is the intensity of perturbations and in the case of no perturbations (

). For random perturbations, we sample categories of nodes and edges from a uniform distribution. For BERT, we create a textual query from a triplet, in which we mask out one of the nodes or the edge. A pretrained BERT model then returns a list of tokens plausible for the masked entity, ranked by their likelihood score (Figure 

5). We also explore contextual BERT perturbations, where each perturbation is conditioned on the current scene graph. This is achieved by simply appending all triplets other than the perturbed one into the BERT query. For BERT-based perturbations, we introduce a tuned threshold defining a lower bound on the BERT score for a given token (a higher threshold denotes a more likely token according to BERT).

We have also noticed that original scene graphs can be very sparse with many nodes being isolated (Figure 8), which can hurt feature propagation in both and . Therefore, in both random and BERT-based cases, we further augment scene graphs by adding new edges, following the same steps described above, but only for edges. We found that adding new edges works well in practice. We leave the addition of new nodes for future work.

2.4 Objective

Our objective is a sum of several losses. These include the standard ones (1) used in scene graph classification and ones specific to our generative pipeline (2)-(2.4).

The scene graph classification loss is defined as:

(1)

where is a scene graph classification loss. It is typically implemented as cross-entropy for nodes and edges (Zellers et al., 2018). We use its improved version—graph density-normalized loss (Knyazev et al., 2020). Using this loss alone corresponds to the baseline model.

The reconstruction loss is defined as:

(2)

where is some perturbation of the ground truth graph . The purpose of this loss is to improve the scene graph classification model , which is our main goal and also at the same time to improve the generator , which can generate better features to further improve .

We also consider conditional adversarial losses. For convenience, we first write these separately for and in a general form. For some features and their corresponding class :

(3)
(4)

We compute these losses for object and edge visual features by using the discriminators and . This loss is also computed for global features using , so that the total discriminator and generator losses are:

(5)

where denotes that our global discriminator is unconditional for simplicity, which can be addressed in future work. Thus, the total loss that we aim to minimize is:

(6)

where loss weights in our experiments.

Model Perturb Scene Graph Classification Predicate Classification
R@100 R@100 wR@20 R@50 R@50 wR@5
MP 48.6 9.1 28.2 78.2 28.4 58.4
MP+GAN 49.4 9.2 28.1 78.4 28.7 59.1
MP+GAN Rand 49.6 9.6 27.9 78.5 28.8 58.9
MP+GAN BERT 49.1 9.4 28.1 78.4 28.2 58.5
MP+GAN BERT+Ctx 49.3 9.5 28.1 78.5 29.3 58.9
Table 2: Results on Visual Genome (Krishna et al., 2017) using the baseline Message Passing (MP) scheme and dataset split of  (Xu et al., 2017) compared to our MP+GAN model with random, BERT-based and contextual BERT perturbations. We report a commonly-used in these tasks recall metric, recall on zero-shot triplets and weighted recall (Knyazev et al., 2020). Higher is better.

3 Experiments

Figure 6: Ablating different components of the pipeline. Here, to analyze generalization using a single score, we compute a weighted average of two metrics: . Higher is better.

For the baseline, we use Message Passing (MP) (Xu et al., 2017), which has strong compositional generalization capabilities (Knyazev et al., 2020). We use a publicly available implementation of MP111https://github.com/rowanz/neural-motifs

. We use its default hyperparameters for the baseline MP and MP with a GAN. We evaluate the models on a standard split of Visual Genome 

(Krishna et al., 2017), with the 150 most frequent object classes and 50 predicate classes, introduced in (Xu et al., 2017).

To train a GAN, we generally follow hyperparameters suggested by SPADE (Park et al., 2019). In particular, we use Spectral Norm (Miyato et al., 2018) for discriminators, Batch Norm (Ioffe & Szegedy, 2015) for generators, TTUR (Heusel et al., 2017) with the learning rate of 1e-4 and 2e-4 for the generator and discriminators respectively.

Following prior work (Xu et al., 2017; Zellers et al., 2018), we evaluate on the scene graph classification (SGCls) and predicate classification (PredCls) tasks using recall metrics (Table 2). Adding our generative method yields noticeable gains on some metrics. Importantly, we improve a recall on zero-shorts (), which better measures compositional generalization. This suggests that increasing the diversity of training samples is beneficial for generalization.

Figure 7: (left) Varying perturbation intensity for random perturbations; (middle) Varying and the BERT threshold for BERT perturbations without context and (right) with it; darker is better.

We perform a number of ablations to determine the effect of the proposed losses and perturbation strategies (Figure 6). The largest drops in performance are observed if we turn off any of the losses in (6) or if we generate directly by using a GCN without a layout, or if we attempt to generate the layout according to Johnson et al. (2018)—with a bounding box loss—instead of using the ground truth one. While generating the layout is desirable for our pipeline, since it allows us to match the layout to the conditioning scene graph, we found it challenging to train layout generation jointly with the rest of the model. This can also explain why our perturbations help only marginally and only helped when few nodes/edges were perturbed (Figure 7).

BERT perturbations are shown to yield some improvement in the zero-shot predicate classification task. However, overall gains are less than expected. We hypothesize this is due to scene graphs being very sparse and the vocabulary being very limited in this split of Visual Genome, so that it is hard to leverage the full potential of BERT (see Figure 8). This hypothesis should be verified on a more diverse dataset, such as the original Visual Genome (Krishna et al., 2017) or GQA (Hudson & Manning, 2019b).

4 Conclusion

Recognizing objects and relationships between them in the form of scene graphs is a challenging task, due to an extremely long tail of the distribution and a strong bias towards frequent compositions. To tackle this problem, we consider a generative approach to increase the diversity of training samples by conditioning a GAN on out-of-distribution compositions. We explore random and language-based strategies to create novel compositions and show improvements in certain cases. We highlight the limitations of this work and, if they are addressed, the results can be further improved.

Image Ground truth Rand () BERT () BERT () BERT+Ctx () BERT+Ctx ()
Figure 8: Examples of perturbations applied to test scene graphs . These examples are not picked in any particular way. We show random perturbations with intensity ; BERT-based perturbations without and with context, with intensity and threshold 7 and with and threshold 9. Red edges denote perturbed or new edges, nodes circled in red denote perturbed nodes. To show diversity, the bottom rows illustrates perturbations for the same graph as in the row second from the bottom. Even though BERT-based perturbations are more plausible than random in some cases, they are still of poor quality, which requires further investigation. One reason for that might be too sparse ground truth scene graphs and their limited vocabulary in this split of Visual Genome, which does not provide enough language context for the BERT model. To alleviate that issue, we would like to explore original scene graphs of Visual Genome or other datasets with more diverse and rich scene graphs. Here, graphs are visualized in an automatic way, which results in some visual artifacts. Intended to be viewed on a computer display.

Acknowledgments

BK is funded by the Mila internship, the Vector Institute and the University of Guelph. CC is funded by DREAM CDT. EB is funded by IVADO. This research was developed with funding from DARPA. The views, opinions and/or findings expressed are those of the authors and should not be interpreted as representing the official views or policies of the Department of Defense or the U.S. Government. The authors also acknowledge support from the Canadian Institute for Advanced Research and the Canada Foundation for Innovation. We are also thankful to Brendan Duke for the help with setting up the compute environment. Resources used in preparing this research were provided, in part, by the Province of Ontario, the Government of Canada through CIFAR, and companies sponsoring the Vector Institute: http://www.vectorinstitute.ai/#partners.

References