Scene Graph Parsing by Attention Graph

09/13/2019 ∙ by Martin Andrews, et al. ∙ 0

Scene graph representations, which form a graph of visual object nodes together with their attributes and relations, have proved useful across a variety of vision and language applications. Recent work in the area has used Natural Language Processing dependency tree methods to automatically build scene graphs. In this work, we present an 'Attention Graph' mechanism that can be trained end-to-end, and produces a scene graph structure that can be lifted directly from the top layer of a standard Transformer model. The scene graphs generated by our model achieve an F-score similarity of 52.21 surpassing the best previous approaches by 2.5



There are no comments yet.


page 2

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In recent years, there have been rapid advances in the capabilities of computer systems that operate at the intersection of visual images and Natural Language Processing - including semantic image retrieval

Johnson et al. (2015); Vendrov et al. (2015)

, image captioning

Mao et al. (2014); Li et al. (2015); Donahue et al. (2015); Liu et al. (2017a), visual question answering Antol et al. (2015); Zhu et al. (2016); Andreas et al. (2016), and referring expressions Hu et al. (2016); Mao et al. (2016); Liu et al. (2017b).

As argued in Johnson et al. (2015), and more recently Wang et al. (2018), encoding images as scene graphs (a type of directed graph that encodes information in terms of objects, attributes of objects, and relationships between objects) is a structured and explainable way of expressing the knowledge from both textual and imaged-based sources, and is able to serve as an expressive form of common representation. The value of such scene graph representations has already been proven in a range of visual tasks, including semantic image retrieval Johnson et al. (2015), and caption quality evaluation Anderson et al. (2016).

One approach to deriving scene graphs from captions / sentences is to use NLP methods for dependency parsing. These methods extend the transition-based parser work of Kiperwasser and Goldberg (2016), to embrace more complex graphs Qi and Manning (2017), or more sophisticated transition schemes Wang et al. (2018).

Recently, an alternative to the sequential state-based models underlying transition-based parsers has gained popularity in general NLP settings, with the Transformer model of Vaswani et al. (2017) leading to high performance Language Models Radford et al. (2018), and NLP models trained using other, innovative, criteria Devlin et al. (2018).

In this paper, we suppliment a pre-trained Transformer model with additional layers that enable us to ‘read off’ graph node connectivity and class information directly. This allows us to benefit from recent advances in methods for training Language Models, while building a task-specific scene graph creation model. The overall structure allows our graph elements to be created ‘holistically’, since the nodes are output in a parallel fashion, rather than through stepwise transition-based parsing.

Based on a comparison with other methods on the same Visual Genome dataset Krishna et al. (2016) (which provides rich amounts of region description - region graph pairs), we demonstrate the potential of this graph-building mechanism.

Figure 1: Example from data exploration site for Krishna et al. (2016). For this region, possible graph objects would be {cat, mouth}, attributes {browncat, blackcat, whitecat  openmouth}, and relationships {cathasmouth, mouthONcat}.

2 Region descriptions and scene graphs

Using the same notation as Wang et al. (2018), there are three types of nodes in a scene graph: object, attribute, and relation. Let be the set of object classes, be the set of attribute types, and be the set of relation types. Given a sentence , our goal in this paper is to parse into a scene graph:


where is the set of object instances mentioned in the sentence , is the set of attributes associated with object instances, and is the set of relations between object instances.

To construct the graph , we first create object nodes for every element in ; then for every pair in , we create an attribute node and add an unlabeled arc ; finally for every triplet in , we create a relation node and add two unlabeled arcs and .

2.1 Dataset pre-processing and sentence-graph alignment

We used the same dataset subsetting, training/test splits, preprocessing steps and graph alignment procedures as Wang et al. (2018), thanks to their release of runnable program code111 Upon publication, we will also release our code, which includes some efficiency improvements for the preprocessing stage, as well as the models used..

2.2 Node labels and arc directions

In this work, we use six node types, which can be communicated using the CONLL file format:

  1. The node label for an object in (either standalone, or the subject of a relationship). The node’s arc points to a (virtual) ROOT node

  2. The node label for a relationship , The node’s arc points to SUBJ

  3. The node label for an object in that is the grammatical object of a relationship, where the node’s arc points to the relevant PRED

  4. The arc label from the head of an object node to the head of an attribute node. The node’s arc points to an object in of node type SUBJ or OBJT

  5. This label is created for nodes whose label is a phrase. For example, the phrase “in front of” is a single relation node in the scene graph. The node’s arc points to the node with which this node’s text should be simply concatenated

  6. This word position is not associated with a node type, and so the corresponding model output is not used to create an arc

Figure 2: Model architecture illustrating Attention Graph mechanism

3 Attention Graph Model

The OpenAI Transformer Radford et al. (2018) Language Model was used as the foundation of our phrase parsing model (see Figure 2). This Transformer model consists of a Byte-Pair Encoded subword Sennrich et al. (2015) embedding layer followed by 12-layers of “decoder-only transformer with masked self-attention heads” Vaswani et al. (2017), pretrained on the standard language modelling objective on a corpus of 7000 books.

The Language Model’s final layer outputs were then fed in to a customised “Attention Graph” layer, which performed two functions : (i) classifying the node type associated with each word; and (ii) specifying the parent node arc required

from that node.

The Attention Graph mechanism is trained using the sum of two cross-entropy loss terms against the respective target node types and parent node indices, weighted by a factor chosen to approximately equalise the contributions to the total loss of the classification and Attention Graph losses. For words where the target node type is none (e.g: common conjunctions), the cross-entropy loss due to that node’s connectivity is multiplied by zero, since its parent is irrelevant.

To convert a given region description to a graph, the BPE-encoded form is presented to the embedding layer in Figure 2, and the node types and node arc destinations are read from and respectively. No post-processing is performed : If the attention mechanism suggests an arc that is not allowed (e.g.: an OBJT points to a word that is not a PRED) the arc is simply dropped.

Parser F-score F-score F-score
reported (our (limited
in Wang et al. (2018) tests) tuples)
Attn. Graph 0.5221 0.5750
Oracle 0.6985 0.6630 0.7256
Table 2: SPICE metric scores between scene graphs parsed from region descriptions and ground truth region graphs on the intersection of Visual Genome Krishna et al. (2016) and MS COCO Lin et al. (2014) validation set.
Parser F-score
Stanford Schuster et al. (2015) 0.3549
SPICE Anderson et al. (2016) 0.4469
Custom Dependency Parsing Wang et al. (2018) 0.4967
Attention Graph (ours) 0.5221
Oracle (as reported in Wang et al. (2018)) 0.6985
Oracle (as used herein) 0.6630
Table 1: SPICE metric scores for the Oracle (using code released by Wang et al. (2018)) and our method, under the base assumptions, and also where the number of tuples is bounded above by the number of potentially useful words in the region description

4 Experiments

We train and evaluate our scene graph parsing model on (a subset of) the Visual Genome (Krishna et al., 2016) dataset, in which each image contains a number of regions, with each region being annotated with a region description and a (potentially empty) region scene graph. Our training set is the intersection of Visual Genome and MS COCO (Lin et al., 2014) train2014 set (34,027 images & 1,070,145 regions), with evaluations split according to the MS COCO val2014 set (17,471 images & 547,795 regions).

We also tested the performance of the ‘Oracle’ (an algorithmic alignment system between the region descriptions and the ground-truth graph tuples) - including a regime where the number of tuples was limited to the number of words, excluding {a, an, the, and}, in the region description.

The model and vectors were of length , consistent with the rest of the Transformer model. We use an initial learning rate of and Adam optimizer (Kingma and Ba, 2014) with / of /

respectively. Training was limited to 4 epochs (about 6 hours on a single Nvidia Titan X).

5 Results

The scores given in Table 2 indicate that there might be significant room for improving the Oracle (which, as the provider of training data, is an upper bound on any trained model’s performance). However, examination of the remaining errors suggests that an near 100% will not be achievable because of issues with the underlying Visual Genome dataset. There are many instances where relationships are clearly stated in the region descriptions, where there is no corresponding graph fragment. Conversely, attributes don’t appear to be region-specific, so there are many cases (as can be seen in Figure 1) where a given object (e.g. ‘cat’) has many attributes in the graph, but no corresponding text in the region description.

Referring to Table 2, our Attention Graph model achieves a higher than previous work, despite the lower performance of the Oracle used to train it222 This needs further investigation, since the Oracle results are a deterministic result of code made available by the authors of Wang et al. (2018)

. The authors also believe that there is potential for further gains, since there has been no hyperparameter tuning, nor have variations of the model been tested (such as adding extra fully bidirectional attention layers).

6 Discussion

While the Visual Genome project is inspirational in its scope, we have found a number of data issues that put a limit on how much the dataset can be relied upon for the current task. Hopefully, there are unreleased data elements that would allow some of its perplexing features to be tidied up.

The recent surge in NLP benchmark performance has come through the use of large Language Models (trained on large, unsupervised datasets) to create contextualised embeddings for further use in downstream tasks. As has been observed Ruder (2018 (accessed November 1, 2018)

, the ability to perform transfer learning using NLP models heralds a new era for building sophisticated systems, even if labelled data is limited.

The Attention Graph mechanism, as introduced here, also illustrates how NLP thinking and visual domains can benefit from each other. Although it was not necessary in the Visual Genome setting, the Attention Graph architecture can be further extended to enable graphs with arbitrary connectivity to be created. This might be done in several distinct ways, for instance (a) Multiple arcs could leave each node, using a multi-head transformer approach; (b) Instead of a SoftMax single-parent output

, multiple directed connections could be made using independent ReLU weight-factors; and (c) Potentially untie the correspondence that the Transformer has from each word to nodes, so that it becomes a Sequence-to-Graph model. Using attention as a way of deriving structure is an interesting avenue for future work.