1 Introduction and Related Work
Reasoning about the world in terms of objects and relationships between them is an important aspect of human and machine cognition. This allows for the ability to generalize to novel compositions of concepts (Atzmon et al., 2016; Johnson et al., 2017; Bahdanau et al., 2018; Keysers et al., 2019; Lake, 2019) and for interpretability of machine vision and reasoning (Norcliffe-Brown et al., 2018; Hudson & Manning, 2019a). When learning to solve image understanding tasks, the model can be exposed to compositions such as “person riding a horse” and “person next to a bike”. Then, at test time, to accurately recognize a novel composition “person riding a bike”, the model needs to understand the concepts of ‘person’, ‘bike’, ‘horse’ and ‘riding’. Recognizing such novel compositions might be possible by capturing and inappropriately exploiting statistical correlations in the data, e.g. ‘riding’ has always occurred in an outdoor landscape or people have always appeared wearing helmets when riding. Unsurprisingly, more careful compositional generalization tests have shown such models can fail remarkably (Atzmon et al., 2016; Lu et al., 2016; Tang et al., 2020; Knyazev et al., 2020), e.g. recall can drop from 30% to 4.5% (Figure 1).
One of the main approaches to tackling this problem is to explicitly introduce an inductive bias of compositionality in the form of translation operators (Zhang et al., 2017), decoupling object and predicate features (Yang et al., 2018) or constructing causal graphs (Tang et al., 2020).
However, another possible approach, still underexplored in compositional image understanding, is exposing the model to a large diversity of training examples that will lead to emergent generalization (Hill et al., 2019; Ravuri & Vinyals, 2019). To avoid expensive labeling of additional data required to increase the diversity, we consider a generative approach, in particular, generative adversarial networks (GANs) (Goodfellow et al., 2014). Recently, GANs have been significantly improved w.r.t. stability of training and the quality of generated samples (see BigGAN (Brock et al., 2018)), but their usage for data augmentation sill remains limited (Ravuri & Vinyals, 2019). One of the ways to address this limitation is to learn the factors of variation in the data (Chen et al., 2016), such that out-of-distribution (OOD) samples useful for augmentation can be created by conditioning on unseen combinations of factors. Indeed, recent work has shown that it is possible to produce plausible OOD examples conditioned on unseen label combinations, by intervening on the underlying graph (Kocaoglu et al., 2017). In this work, we have direct access to underlying graphs of images in the form of scene graphs. By perturbing these graphs in a certain way, we propose to create OOD compositions, so that a GAN conditioned on them will be encouraged to generate diverse OOD samples.
The closest work that also considers a generative approach for visual relationship detection is (Wang et al., 2019).
Compared to their work, where they condition on triplets (i.e. <subject, predicate, object>, such as <person, riding, horse> ), we condition a GAN on scene graphs, which combinatorially increases the number of possible augmentations. Conditioning on a whole scene graph is also beneficial to alleviate the creation of totally implausible scenarios: by randomly perturbing only part of the scene graph, we maintain its overall likelihood. Next, we condition on object and predicate categories, rather than visual features of training images. This avoids the issue of mismatched features (coming from different contexts), which can degrade the quality of generated features. Our model also follows a standard scene graph classification pipeline (Xu et al., 2017; Zellers et al., 2018)
including object and predicate classification, instead of classifying only the predicate, which enables a more comprehensive study of compositional generalization(Bahdanau et al., 2018). Finally, we train the generative and classification models jointly end-to-end (Figure 3).
Generating images conditioned on a structured input, such as a scene graph or a bounding boxes layout has been explored in several recent works, summarized in Table 1. In this work, we rely on (Johnson et al., 2018) to generate visual features given a scene graph .
|Sg2Im (Johnson et al., 2018)||Determ. given or||Inferred from or|
|InterSg2Im (Mittal et al., 2019)||Determ. given||Spatial only|
|SOARISG (Ashual & Wolf, 2019)||Provided by user||Spatial only|
|OC-GAN (Sylvain et al., 2020)||Provided by user||Spatial only|
|Layout2Im (Zhao et al., 2019)||Provided by user||Not supported|
|LostGAN (Sun & Wu, 2019)||Provided by user||Not supported|
|SPADE (Park et al., 2019)||Masks provided by user||Not supported|
|Caption2Im (Hong et al., 2018)||Determ. given||Inferred from|
|Ours||Ground truth given||Inferred from|
Other works focus on improving the quality of generated images by avoiding generating the layout or avoiding conditioning on a full spectrum of relationships. We show that using a layout is essential in our task and we leverage the ground-truth one instead of generating it, because (1) the layout generated by the model of Johnson et al. (2018) was often implausible in case of OOD conditioning (Figure 2); (2) in our experience, joint training of layout generation with the rest of the model was unstable. The limitations of our approach are discussed in Section 3.
|cup on table||person riding elephant||cup under table||clock under table||giraffe riding elephant||giraffe wears shirt|
We use conditional generative adversarial networks (CGAN) (Mirza & Osindero, 2014) (Figure 3). We have also considered ACGAN (Odena et al., 2017), but we found its performance significantly inferior to CGAN in our task.
2.1 Scene Graph Classification Model
During training, we are given tuples of an image and the corresponding ground truth scene graph consisting of objects and relationships between them , as well as bounding boxes for all objects. Following Xu et al. (2017), we use a pretrained object detector (Ren et al., 2015) to extract global visual features and, given , visual features of nodes and edges respectively.
Features are extracted from one of the last convolutional layers, while and are extracted from using RoIAlign (He et al., 2017) given bounding boxes and their unions (for edges) respectively. We follow Xu et al. (2017) and do not update the detector during training, and only solve the tasks that assume the ground truth bounding boxes are available instead of proposals. Given visual features , the message passing (MP) model of Xu et al. (2017) is trained to predict a scene graph , i.e. we need to correctly assign object labels to node features and predicate classes to edge features . This completes our baseline model.
2.2 Generative Model
To augment the baseline model, we generate novel visual features conditioned on a scene graph obtained by perturbing a ground truth . Analogous to the baseline model, given augmented features we train to predict , so that such augmentation can improve the performance of at test time. Nodes and edges in are represented as Glove word embeddings (Pennington et al., 2014)
. The noise vectoris injected at each node and edge in to make sure different features are generated for the same . The generator is implemented as a graph convolutional network, followed by layout construction and feature refinement (Johnson et al., 2018), which we explain in Figure 4. We have independent discriminators for nodes and edges, and , that discriminate real features (, ) from fake ones (, ) conditioned on their class as per the CGAN. We also add a global discriminator acting on feature maps , which encourages global consistency between nodes and edges. Thus, and are trained to match marginal distributions, while
is trained to match the joint distribution. The intuition is that the right balance between these discriminators should enable the generation of realistic visual features conditioned on OOD scene graphs.
2.3 Scene Graph Perturbations
We experiment with two perturbations: random and BERT-based (Devlin et al., 2018). In both cases, for a scene graph with nodes and edges, we randomly choose nodes and edges for which we change the category, where is the intensity of perturbations and in the case of no perturbations (
). For random perturbations, we sample categories of nodes and edges from a uniform distribution. For BERT, we create a textual query from a triplet, in which we mask out one of the nodes or the edge. A pretrained BERT model then returns a list of tokens plausible for the masked entity, ranked by their likelihood score (Figure5). We also explore contextual BERT perturbations, where each perturbation is conditioned on the current scene graph. This is achieved by simply appending all triplets other than the perturbed one into the BERT query. For BERT-based perturbations, we introduce a tuned threshold defining a lower bound on the BERT score for a given token (a higher threshold denotes a more likely token according to BERT).
We have also noticed that original scene graphs can be very sparse with many nodes being isolated (Figure 8), which can hurt feature propagation in both and . Therefore, in both random and BERT-based cases, we further augment scene graphs by adding new edges, following the same steps described above, but only for edges. We found that adding new edges works well in practice. We leave the addition of new nodes for future work.
The scene graph classification loss is defined as:
where is a scene graph classification loss. It is typically implemented as cross-entropy for nodes and edges (Zellers et al., 2018). We use its improved version—graph density-normalized loss (Knyazev et al., 2020). Using this loss alone corresponds to the baseline model.
The reconstruction loss is defined as:
where is some perturbation of the ground truth graph . The purpose of this loss is to improve the scene graph classification model , which is our main goal and also at the same time to improve the generator , which can generate better features to further improve .
We also consider conditional adversarial losses. For convenience, we first write these separately for and in a general form. For some features and their corresponding class :
We compute these losses for object and edge visual features by using the discriminators and . This loss is also computed for global features using , so that the total discriminator and generator losses are:
where denotes that our global discriminator is unconditional for simplicity, which can be addressed in future work. Thus, the total loss that we aim to minimize is:
where loss weights in our experiments.
|Model||Perturb||Scene Graph Classification||Predicate Classification|
For the baseline, we use Message Passing (MP) (Xu et al., 2017), which has strong compositional generalization capabilities (Knyazev et al., 2020). We use a publicly available implementation of MP111https://github.com/rowanz/neural-motifs
. We use its default hyperparameters for the baseline MP and MP with a GAN. We evaluate the models on a standard split of Visual Genome(Krishna et al., 2017), with the 150 most frequent object classes and 50 predicate classes, introduced in (Xu et al., 2017).
To train a GAN, we generally follow hyperparameters suggested by SPADE (Park et al., 2019). In particular, we use Spectral Norm (Miyato et al., 2018) for discriminators, Batch Norm (Ioffe & Szegedy, 2015) for generators, TTUR (Heusel et al., 2017) with the learning rate of 1e-4 and 2e-4 for the generator and discriminators respectively.
Following prior work (Xu et al., 2017; Zellers et al., 2018), we evaluate on the scene graph classification (SGCls) and predicate classification (PredCls) tasks using recall metrics (Table 2). Adding our generative method yields noticeable gains on some metrics. Importantly, we improve a recall on zero-shorts (), which better measures compositional generalization. This suggests that increasing the diversity of training samples is beneficial for generalization.
We perform a number of ablations to determine the effect of the proposed losses and perturbation strategies (Figure 6). The largest drops in performance are observed if we turn off any of the losses in (6) or if we generate directly by using a GCN without a layout, or if we attempt to generate the layout according to Johnson et al. (2018)—with a bounding box loss—instead of using the ground truth one. While generating the layout is desirable for our pipeline, since it allows us to match the layout to the conditioning scene graph, we found it challenging to train layout generation jointly with the rest of the model. This can also explain why our perturbations help only marginally and only helped when few nodes/edges were perturbed (Figure 7).
BERT perturbations are shown to yield some improvement in the zero-shot predicate classification task. However, overall gains are less than expected. We hypothesize this is due to scene graphs being very sparse and the vocabulary being very limited in this split of Visual Genome, so that it is hard to leverage the full potential of BERT (see Figure 8). This hypothesis should be verified on a more diverse dataset, such as the original Visual Genome (Krishna et al., 2017) or GQA (Hudson & Manning, 2019b).
Recognizing objects and relationships between them in the form of scene graphs is a challenging task, due to an extremely long tail of the distribution and a strong bias towards frequent compositions. To tackle this problem, we consider a generative approach to increase the diversity of training samples by conditioning a GAN on out-of-distribution compositions. We explore random and language-based strategies to create novel compositions and show improvements in certain cases. We highlight the limitations of this work and, if they are addressed, the results can be further improved.
|Image||Ground truth||Rand ()||BERT ()||BERT ()||BERT+Ctx ()||BERT+Ctx ()|
BK is funded by the Mila internship, the Vector Institute and the University of Guelph. CC is funded by DREAM CDT. EB is funded by IVADO. This research was developed with funding from DARPA. The views, opinions and/or findings expressed are those of the authors and should not be interpreted as representing the official views or policies of the Department of Defense or the U.S. Government. The authors also acknowledge support from the Canadian Institute for Advanced Research and the Canada Foundation for Innovation. We are also thankful to Brendan Duke for the help with setting up the compute environment. Resources used in preparing this research were provided, in part, by the Province of Ontario, the Government of Canada through CIFAR, and companies sponsoring the Vector Institute: http://www.vectorinstitute.ai/#partners.
- Ashual & Wolf (2019) Ashual, O. and Wolf, L. Specifying object attributes and relations in interactive scene generation. In ICCV, 2019.
- Atzmon et al. (2016) Atzmon, Y., Berant, J., Kezami, V., Globerson, A., and Chechik, G. Learning to generalize to new compositions in image understanding. arXiv preprint arXiv:1608.07639, 2016.
- Bahdanau et al. (2018) Bahdanau, D., Murty, S., Noukhovitch, M., Nguyen, T. H., de Vries, H., and Courville, A. Systematic generalization: what is required and can it be learned? arXiv preprint arXiv:1811.12889, 2018.
- Brock et al. (2018) Brock, A., Donahue, J., and Simonyan, K. Large scale gan training for high fidelity natural image synthesis. arXiv preprint arXiv:1809.11096, 2018.
- Chen et al. (2016) Chen, X., Duan, Y., Houthooft, R., Schulman, J., Sutskever, I., and Abbeel, P. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. In Advances in neural information processing systems, pp. 2172–2180, 2016.
- Devlin et al. (2018) Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
- Goodfellow et al. (2014) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680, 2014.
He et al. (2017)
He, K., Gkioxari, G., Dollár, P., and Girshick, R.
Proceedings of the IEEE international conference on computer vision, pp. 2961–2969, 2017.
- Heusel et al. (2017) Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., and Hochreiter, S. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in neural information processing systems, pp. 6626–6637, 2017.
- Hill et al. (2019) Hill, F., Lampinen, A., Schneider, R., Clark, S., Botvinick, M., McClelland, J. L., and Santoro, A. Environmental drivers of systematicity and generalization in a situated agent, 2019.
Hong et al. (2018)
Hong, S., Yang, D., Choi, J., and Lee, H.
Inferring semantic layout for hierarchical text-to-image synthesis.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7986–7994, 2018.
- Hudson & Manning (2019a) Hudson, D. and Manning, C. D. Learning by abstraction: The neural state machine. In Wallach, H., Larochelle, H., Beygelzimer, A., d’Alché Buc, F., Fox, E., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 32, pp. 5903–5916. Curran Associates, Inc., 2019a.
- Hudson & Manning (2019b) Hudson, D. A. and Manning, C. D. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6700–6709, 2019b.
- Ioffe & Szegedy (2015) Ioffe, S. and Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.
- Johnson et al. (2017) Johnson, J., Hariharan, B., van der Maaten, L., Fei-Fei, L., Lawrence Zitnick, C., and Girshick, R. Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2901–2910, 2017.
- Johnson et al. (2018) Johnson, J., Gupta, A., and Fei-Fei, L. Image generation from scene graphs. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1219–1228, 2018.
- Keysers et al. (2019) Keysers, D., Schärli, N., Scales, N., Buisman, H., Furrer, D., Kashubin, S., Momchev, N., Sinopalnikov, D., Stafiniak, L., Tihon, T., et al. Measuring compositional generalization: A comprehensive method on realistic data. arXiv preprint arXiv:1912.09713, 2019.
- Knyazev et al. (2020) Knyazev, B., de Vries, H., Cangea, C., Taylor, G. W., Courville, A., and Belilovsky, E. Graph density-aware losses for novel compositions in scene graph generation. arXiv preprint arXiv:2005.08230, 2020.
- Kocaoglu et al. (2017) Kocaoglu, M., Snyder, C., Dimakis, A. G., and Vishwanath, S. Causalgan: Learning causal implicit generative models with adversarial training. arXiv preprint arXiv:1709.02023, 2017.
- Krishna et al. (2017) Krishna, R., Zhu, Y., Groth, O., Johnson, J., Hata, K., Kravitz, J., Chen, S., Kalantidis, Y., Li, L.-J., Shamma, D. A., et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision, 123(1):32–73, 2017.
- Lake (2019) Lake, B. M. Compositional generalization through meta sequence-to-sequence learning. In Advances in Neural Information Processing Systems, pp. 9788–9798, 2019.
- Lu et al. (2016) Lu, C., Krishna, R., Bernstein, M., and Fei-Fei, L. Visual relationship detection with language priors. In European conference on computer vision, pp. 852–869. Springer, 2016.
- Mirza & Osindero (2014) Mirza, M. and Osindero, S. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784, 2014.
- Mittal et al. (2019) Mittal, G., Agrawal, S., Agarwal, A., Mehta, S., and Marwah, T. Interactive image generation using scene graphs. arXiv preprint arXiv:1905.03743, 2019.
- Miyato et al. (2018) Miyato, T., Kataoka, T., Koyama, M., and Yoshida, Y. Spectral normalization for generative adversarial networks. arXiv preprint arXiv:1802.05957, 2018.
- Norcliffe-Brown et al. (2018) Norcliffe-Brown, W., Vafeias, S., and Parisot, S. Learning conditioned graph structures for interpretable visual question answering. In Advances in Neural Information Processing Systems, pp. 8334–8343, 2018.
- Odena et al. (2017) Odena, A., Olah, C., and Shlens, J. Conditional image synthesis with auxiliary classifier gans. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 2642–2651. JMLR. org, 2017.
- Park et al. (2019) Park, T., Liu, M.-Y., Wang, T.-C., and Zhu, J.-Y. Semantic image synthesis with spatially-adaptive normalization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2337–2346, 2019.
Pennington et al. (2014)
Pennington, J., Socher, R., and Manning, C. D.
Glove: Global vectors for word representation.
Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp. 1532–1543, 2014.
- Radford et al. (2015) Radford, A., Metz, L., and Chintala, S. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.
- Ravuri & Vinyals (2019) Ravuri, S. and Vinyals, O. Seeing is not necessarily believing: Limitations of biggans for data augmentation. 2019.
- Ren et al. (2015) Ren, S., He, K., Girshick, R., and Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pp. 91–99, 2015.
- Sun & Wu (2019) Sun, W. and Wu, T. Image synthesis from reconfigurable layout and style. In Proceedings of the IEEE International Conference on Computer Vision, pp. 10531–10540, 2019.
- Sylvain et al. (2020) Sylvain, T., Zhang, P., Bengio, Y., Hjelm, R. D., and Sharma, S. Object-centric image generation from layouts. arXiv preprint arXiv:2003.07449, 2020.
- Tang et al. (2020) Tang, K., Niu, Y., Huang, J., Shi, J., and Zhang, H. Unbiased scene graph generation from biased training. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2020.
- Wang et al. (2019) Wang, X., Sun, Q., ANG, M., and CHUA, T.-S. Generating expensive relationship features from cheap objects. 2019.
- Xu et al. (2017) Xu, D., Zhu, Y., Choy, C. B., and Fei-Fei, L. Scene graph generation by iterative message passing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5410–5419, 2017.
- Yang et al. (2018) Yang, X., Zhang, H., and Cai, J. Shuffle-then-assemble: Learning object-agnostic visual relationship features. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 36–52, 2018.
- Zellers et al. (2018) Zellers, R., Yatskar, M., Thomson, S., and Choi, Y. Neural motifs: Scene graph parsing with global context. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5831–5840, 2018.
- Zhang et al. (2017) Zhang, H., Kyaw, Z., Chang, S.-F., and Chua, T.-S. Visual translation embedding network for visual relation detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5532–5540, 2017.
- Zhao et al. (2019) Zhao, B., Meng, L., Yin, W., and Sigal, L. Image generation from layout. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8584–8593, 2019.