Log In Sign Up

Visual Semantic Parsing: From Images to Abstract Meaning Representation

The success of scene graphs for visual scene understanding has brought attention to the benefits of abstracting a visual input (e.g., image) into a structured representation, where entities (people and objects) are nodes connected by edges specifying their relations. Building these representations, however, requires expensive manual annotation in the form of images paired with their scene graphs or frames. These formalisms remain limited in the nature of entities and relations they can capture. In this paper, we propose to leverage a widely-used meaning representation in the field of natural language processing, the Abstract Meaning Representation (AMR), to address these shortcomings. Compared to scene graphs, which largely emphasize spatial relationships, our visual AMR graphs are more linguistically informed, with a focus on higher-level semantic concepts extrapolated from visual input. Moreover, they allow us to generate meta-AMR graphs to unify information contained in multiple image descriptions under one representation. Through extensive experimentation and analysis, we demonstrate that we can re-purpose an existing text-to-AMR parser to parse images into AMRs. Our findings point to important future research directions for improved scene understanding.


page 2

page 8

page 9

page 16

page 17

page 18


SGRAM: Improving Scene Graph Parsing via Abstract Meaning Representation

Scene graph is structured semantic representation that can be modeled as...

Evaluating Scoped Meaning Representations

Semantic parsing offers many opportunities to improve natural language u...

Image interpretation by iterative bottom-up top-down processing

Scene understanding requires the extraction and representation of scene ...

Scene Graph Parsing by Attention Graph

Scene graph representations, which form a graph of visual object nodes t...

Design Representation as Semantic Networks

Design representation is a common task in the design process to facilita...

Text2Scene: Generating Abstract Scenes from Textual Descriptions

In this paper, we propose an end-to-end model that learns to interpret n...

From Spatial Relations to Spatial Configurations

Spatial Reasoning from language is essential for natural language unders...

1 Introduction

footnotetext: *Work done during an internship at Samsung AI Centre - Torontofootnotetext: †Work done while at Samsung AI Centre - Toronto
Figure 1: An image from MSCOCO and Visual Genome dataset, along with its five human-generated captions, and: (a) an image-level meta-AMR graph capturing its overall semantics, (b) its human-generated scene graph.

The ability to understand and describe a scene is fundamental for the development of truly intelligent systems, including autonomous vehicles, robots navigating an environment, or even simpler applications such as language-based image retrieval. Much work in computer vision has focused on two key aspects of scene understanding, namely, recognizing entities, including object detection

Liu et al. (2016); Ren et al. (2015); Carion et al. (2020); Liu et al. (2020a) and activity recognition Herath et al. (2017); Kong and Fu (2022); Li et al. (2018); Gao et al. (2018), as well as understanding how entities are related to each other, e.g., human–object interaction Hou et al. (2020); Zou et al. (2021) and relation detection Lu et al. (2016); Zhang et al. (2017); Zellers et al. (2018).

A natural way of representing scene entities and their relations is in graph form, so it is perhaps unsurprising that a lot of work has focused on graph-based scene representations and especially on scene graphs Johnson et al. (2015a). Scene graphs encode the salient regions in an image (mainly, objects) as nodes, and the relations among these (mostly spatial in nature) as edges, both labelled via natural language tags; see Fig. 1(b) for an example scene graph. Along the same lines, Yatskar et al. (2016) propose to represent a scene as a semantic role labelled frame, drawn from FrameNet Ruppenhofer et al. (2016) — a linguistically-motivated approach that draws on semantic role labelling literature.

Scene graphs and situation frames can capture important aspects of an image, yet they are limited in important ways. They both require expensive manual annotation in the form of images paired with their corresponding scene graphs or frames. Scene graphs in particular also suffer from being limited in the nature of entities and relations that they capture (see Section 2 for a detailed analysis). Ideally, we would like to capture event-level semantics (same as in situation recognition) but as a structured graph that captures a diverse set of relations and goes beyond low-level visual semantics.

Inspired by the linguistically-motivated image understanding research, we propose to represent images using a well-known graph formalism for language understanding, i.e., Abstract Meaning Representations (AMRs Banarescu et al., 2013). Similarly to (visual) semantic role labeling, AMRs also represent “who did what to whom, where, when, and how?” Màrquez et al. (2008), but in a more structured way via transforming an image into a graph representation. AMRs not only encode the main events, their participants and arguments, as well as their relations (as in semantic role labelling/situation recognition), but also relations among various other participants and arguments; see Fig. 1(a). Importantly, AMR is a broadly-adopted and dynamically evolving formalism (e.g., Bonial et al., 2020; Bonn et al., 2020; Naseem et al., 2021), and AMR parsing is an active and successful area of research (e.g., Zhang et al., 2019b; Bevilacqua et al., 2021; Xia et al., 2021; Drozdov et al., 2022). Finally, given the high quality of existing AMR parsers (for language), we do not need manual AMR annotations for images, and can rely on existing image–caption datasets to create high quality silver data for image-to-AMR parsing. In summary, we make the following contributions:

  • [leftmargin=*]

  • We introduce the novel problem of parsing images into Abstract Meaning Representations, a widely-adopted linguistically-motivated graph formalism; and propose the first image-to-AMR parser model for the task.

  • We present a detailed analysis and comparison between scene graphs and AMRs with respect to the nature of entities and relations they capture, results of which further motivates research in the use of AMRs for better image understanding.

  • Inspired by work on multi-sentence AMR, we propose a graph-to-graph transformation algorithm that combines the meanings of several image caption descriptions into image-level meta-AMR graphs. The motivation behind generating the meta-AMRs is to build a graph that covers most of entities, predicates, and semantic relations contained in the individual caption AMRs.

Our analyses suggest that AMRs encode aspects of an image content that are not captured by the commonly-used scene graphs. Our initial results on re-purposing a text-to-AMR parser for image-to-AMR parsing, as well as on creating image-level meta-AMRs, point to exciting future research directions for improved scene understanding.

2 Motivation: AMRs vs. Scene Graphs

Scene graphs (SGs) are a widely-adopted graph formalism for representing the semantic content of an image. Scene graphs have been shown useful for various downstream tasks, such as image captioning

Yang et al. (2019); Li and Jiang (2019); Zhong et al. (2020), visual question answering Zhang et al. (2019a); Hildebrandt et al. (2020); Damodaran et al. (2021), and image retrieval Johnson et al. (2015b); Schuster et al. (2015); Wang et al. (2020); Schroeder and Tripathi (2020). However, learning to automatically generate SGs requires expensive manual annotations (object bounding boxes and their relations). SGs were also shown to be highly biased in the entity and relation types that they capture. For example, an analysis by Zellers et al. (2018) reveals that clothing (e.g., dress) and object/body parts (e.g., eyes, wheel) make up over one-third of entity instances in the SGs corresponding to the Visual Genome images Krishna et al. (2016), and that more than % of all relation instances belong to the two categories of geometric (e.g., behind) and possessive (e.g., have).

One advantage of AMR graphs is that we can draw on supervision through captions associated with images. Nonetheless, the question remains as to what types of entities and relations are encoded by AMR graphs, and how these differ from SGs. To answer this question, we follow an approach similar to Zellers et al. (2018), and categorize entities and relations in SG and AMR graphs corresponding to a sample of  K images. We use the same categories as Zellers et al., but add a few new ones to capture relation types specific to AMRs, namely, Attribute (small), Quantifier (few), Event (soccer), and AMR specific (date-entity). Details of our categorization process are provided in Appendix A.

Figure 2 shows the distribution of instances for each Entity and Relation category, compared across SG and AMR graphs. AMRs tend to encode a more diverse set of relations, and in particular capture more of the abstract semantic relations that are missing from SGs. This is expected because our caption-generated AMRs by design capture the essential meaning of the image descriptions and, as such, encode how people perceive and describe scenes. In contrast, SGs are designed to capture the content of an image, including regions representing objects and (mainly spatial/geometric) visually-observable relations; see Fig. 1 for SG and AMR graphs corresponding to an image. In the context of Entities, and a major departure from SGs, (object/body) parts are less frequently encoded in AMRs, pointing to the well-known whole-object bias in how people perceive and describe scenes Markman (1990); Fei-Fei et al. (2007). In contrast, location is more frequent in AMRs.

The focus of AMRs on abstract content suggests that they have the potential for improving down-stream tasks, especially when the task requires an understanding of the higher level semantics of an image. Interestingly, a recent study showed that using AMRs as an intermediate representation for textual SG parsing helps improve the quality of the parsed SGs (Choi et al., 2022), even though AMRs and SGs encode qualitatively different information. Since AMRs tend to capture higher level semantics, we propose to use them as the final image representation. The question remains as to how difficult it is to directly learn such representations from images. The rest of the paper focuses on answering this question.

Figure 2: Statistics on a selected set of top-frequency Entity and Relation categories, extracted from the AMR and SG graphs corresponding to around K images that appear in both Visual Genome and MSCOCO.

3 Method

3.1 Parsing Images into AMR Graphs

Figure 3: Model architecture for our two image-to-AMR models: (a) Img2Amr: A direct model that uses a single seq2seq encoder–decoder to generate linearlized AMRs from input images; and (b) Img2Amr: A two-stage model containing two independent seq2seq components. and stand for global and region features, for tag embeddings, and for the embeddings of the predicted nodes. The input and output space of the decoders come from the AMR vocabulary.

We develop image-to-AMR parsers based on a state-of-the-art seq2seq text-to-AMR parser, Spring Bevilacqua et al. (2021), and a multimodal VL-Bart Cho et al. (2021). Both are transformer-based architectures with a bi-directional encoder and an auto-regressive decoder. Spring extends a pre-trained seq2seq model, Bart Lewis et al. (2020), by fine-tuning it on AMR parsing and generation. Next, we describe our models, input representation, and training.


We build two variants of our image-to-AMR parser, as depicted in Fig. 3(a) and (b).

  • Our first model, which we refer to as Img2Amr, modifies Spring by replacing Bart with its vision-and-language counterpart, VL-Bart Cho et al. (2021). VL-Bart extends Bart with visual understanding ability through fine-tuning on multiple vision-and-language tasks. With this modification, our model can receive visual features (plus text) as input, and generate linearized AMR graphs.

  • Our second model, inspired by text-to-graph AMR parsers (e.g., Zhang et al., 2019b; Xia et al., 2021), generates linearized AMRs in two stages by first predicting the nodes, and then the relations. Specifically, we first predict the nodes of the linearized AMR for a given image. These predicted nodes are then fed (along with the image) as input into a second seq2seq model that generates a linearized AMR (effectively adding the relations). We refer to this model as Img2Amr.

Input Representation.

To represent images, we follow VL-Bart, which takes the output of Faster R-CNN Ren et al. (2015) (i.e., region features and coordinates for regions) and projects them onto

dimensional vectors via two separate fully-connected layers. Faster R-CNN region features are obtained via training for visual object and attribute classification 

Anderson et al. (2018) on Visual Genome. The visual input to our model is composed of position-aware embeddings for the regions, plus a global image-level feature ( and in Fig. 3). To get the position-aware embeddings for the regions, we add together the projected region and coordinate embeddings. To get the global image feature, we use the output of the final hidden layer in ResNet-101 He et al. (2016), which is passed through the same fully connected layer as the regions to obtain a 768-dimensional vector.


To benefit from transfer learning, we initialize the encoder and decoder weights of both our models from the pre-trained VL-

Bart. This is a reasonable initialization strategy, given that VL-Bart has been pre-trained on input similar to ours. Moreover, a large number of AMR labels are drawn from the English vocabulary, and thus the pre-training of VL-Bart should also be appropriate for AMR generation. We fine-tune our models on the task of image-to-AMR generation, using images paired with their automatically-generated AMR graphs. We consider two alternative AMR representations: (a) caption AMRs, created directly from captions associated with images (see Section 4 for details); and (b) image-level meta-AMRs, constructed through an algorithm we describe below in Section 3.2. We perform experiments with either caption or meta-AMRs, where we train and test on the same type of AMRs. For the various stages of training, we use the cross-entropy loss between the model predictions and the ground-truth labels for each token, where the model predictions are obtained greedily, i.e., choosing the token with the maximum score at each step of the sequence generation.

3.2 Learning per-Image meta-AMR Graphs

1:Input: human-generated image descriptions for a given image ; a set of pre-defined AMR relation types ;
2:Output: A meta-AMR graph ;
3:Initialize: Generate AMR graphs for the descriptions using a pre-trained AMR semantic parser; Initialize to be the null graph.
5:for  do
6:     getEdges
7:    for   do (, ) is a pair of nodes connected via an edge labeled as
8:        if keys()    keys()     then
9:            .add Add a new edge when neither nor previously included, and belongs to a pre-selected set of AMR relation types             
10: weaklyConnectedComponents() Get all connected components as candidates since it should be a connected graph according to the definition of AMR
11: getLargestComponent() Get the candidate with the largest number of nodes as it can cover most entities and predicates in the image
12: refineNodes() Replace node types by their frequent hypernym if available
Algorithm 1 Meta-AMR Graph Construction

Recall that, in order to collect a data set of images paired with their AMR graphs, we rely on image–caption datasets such as MSCOCO. Specifically, we use a pre-trained AMR parser to generate AMR graphs from each caption of an image. Images can be described in many different ways, e.g., each image in MSCOCO comes with five different human-generated captions. We hypothesize that these captions collectively represent the content of the image they are describing, and as such propose to also combine the caption AMRs into image-level meta-AMR graphs through a merge and refine process that we explain next.

Prior work has used graph-to-graph transformations for merging sentence-level AMRs into document-level AMRs for abstractive and multi-document summarization 

(e.g., Liu et al., 2015; Liao et al., 2018; Naseem et al., 2021). Unlike in a summarization task, captions do not form a coherent document, but instead collectively describe an image. Inspired by prior work, we propose our graph-to-graph transformation algorithm that learns a unified meta-AMR graph from caption graphs; see Algorithm 1. Specifically, we first merge the nodes and edges from the original set of caption-level AMRs, only including a pre-defined set of relation/edge labels. We then select the largest connected component of this merged graph, which we further refine by replacing non-predicate nodes by their more frequent hypernyms, when available. The motivation behind this refinement process is to reduce the complexity of the meta-AMR graphs (in terms of their size), which would potentially improve parsing performance. An example of a meta-AMR graph generated from caption AMRs is given in Appendix C.

AMR graphs of the MSCOCO training captions contain more than types of semantic relations and more than K node types, with long-tailed distributions; see Fig. 6 in Appendix B. To refine meta-AMR graphs, we only maintain the top- most frequent relation types that include core roles, such as arg0, arg1, etc., as well as high-frequency non-core roles, such as mod and location. To further refine the graphs, we replace each non-predicate node (e.g., salmon) with its most frequent hypernym (e.g., fish) according to WordNet Fellbaum (1998). This results in just about reduction in the number of node types (to K). The average complexity of graphs is also reduced from nodes and relations to and , respectively.

4 Experimental Setup


For our task of AMR generation from images, we use an augmented version of the standard MSCOCO image–caption dataset, which is composed of images paired with their captions, automatically generated caption-level linearized AMR graphs, and an image-level linearized meta-AMR graph. We use the splits established in previous work Karpathy and Fei-Fei (2015), containing training, validation, and test images, where each image is associated with five manually-annotated captions. Following the cross-modal retrieval work involving MSCOCO (e.g., Lee et al., 2018), we use a subset of the val and test sets, containing images each. AMR graphs of the captions are obtained by running the Spring text-to-AMR parser Bevilacqua et al. (2021) that is trained on AMR2.0 dataset.111 The meta-AMR graph is created from the individual AMRs through our merge and refine process described in Algorithm 1 of Section 3.

Parser implementation details.

We initialize our Img2Amr models from VL-Bart, which is based on BartBase. Bart uses a sub-word tokenizer with a vocabulary size of . Following Spring, we expand the vocabulary to include frequent AMR-specific tokens and symbols (e.g., :op, arg1, temporal-entity), resulting in a vocabulary size of . The addition of AMR-specific symbols in vocabulary improves efficiency by avoiding extensive sub-token splitting. The embeddings of these additional tokens are initialized by taking the average of the embeddings of their sub-word constituents. The Img2Amr models are trained for epochs, while the Img2Amr models are trained for epochs per stage. We use a batch size of with gradients being accumulated for batches (hence an effective batch size of ), the batch size was limited due to the length of the linearized meta-AMRs. The optimizer used is RAdam Liu et al. (2020b), with a learning rate of , and a dropout rate of . Each experiment is run on one Nvidia V100-32G GPU. Model selection is done based on the best SemBleu-1.

5 Results

Model Train/Test AMRs Smatch SemBleu-1 SemBleu-2
Img2Amr meta-AMRs 37.7 0.2 32.6 0.8 15.2 0.5
Img2Amr meta-AMRs 38.6 0.3 30.9 0.4 15.6 0.3
Img2Amr caption AMRs 52.3 0.4 68.6 0.4 38.4 0.8
Table 1: test results, averaged over runs, for our Img2Amr models that follow the best setting, when trained and tested on either meta-AMRs or caption AMRs.
Model Bleu-4 Cider Meteor Spice
Img2Amr + Amr2Txt 31.7 111.7 26.8 20.4
VL-Bart 35.1 116.6 28.7 21.5
Table 2: Image captioning results on test set, compared with the best reported captioning results for VL-Bart.

5.1 Image-to-AMR Parsing Performance

We use the standard measures of Smatch Cai and Knight (2013) and SemBleu Song and Gildea (2019) to evaluate our various image-to-AMR models. Smatch compares two AMR graphs by calculating the F1-score between the nodes and edges of these two graphs. This score is calculated after applying a one-to-one mapping of the two AMRs based on their nodes. This mapping is chosen so that it maximizes the F1-score between the two graphs. However, since finding the best exact mapping is NP-complete, a greedy hill-climbing algorithm with multiple random initializations is used to obtain this best mapping. SemBleu extends the Bleu Papineni et al. (2002) metric to AMR graphs, where each AMR node is considered a unigram (used in SemBleu-1), and each pair of connected nodes along with their connecting edge is considered a bigram (used in SemBleu-2). These metrics are calculated between the model predictions and the noisy AMR ground-truth.

We report results on generating caption AMRs (when the models are trained and tested on these AMRs), as well as meta-AMRs. When evaluating on caption AMR generation, we compare the model output to the five reference AMRs, and report the maximum of these five scores. The intuition is to compare the predicted AMR to the most similar AMR from the five references. Table 1 (top two rows) shows the performance of the models on the task of generating meta-AMRs from test images. We perform ablations of the model input combinations on val set (see Section D below), and report test results for the best setting, which uses all the input features for both models. The model does slightly better on this task, when looking at the Smatch and SemBleu-2 metrics that take the structure of AMRs into account. Note that SemBlue-1 only compares the nodes of the predicted and ground-truth graphs.

Meta-AMR graphs tend to, on average, be longer than individual caption AMRs ( vs nodes and relations). We thus expect the generation of meta-AMRs to be harder than that of caption AMRs. Moreover, although we hypothesize that meta-AMRs capture a holistic meaning for an image, the caption AMRs still capture some (possibly salient) aspect of an image content, and as such are useful to predict, especially if they can be generated with higher accuracy. We thus report the performance of our model on generating caption AMRs (when trained on caption AMR graphs); see the final row of Table 1. We can see that, as expected, performance is much higher on generating caption AMRs vs. meta-AMRs.

Given that AMRs and natural language are by design closer in the semantic space, unlike for AMRs and images, it is not unexpected that the results for our image-to-AMR task are not comparable with those of SoTA text-to-AMR parsers, including Spring. Our results highlight the challenges similar to those of general image-to-graph parsing techniques, including visual scene graph generation Zhu et al. (2022), where there still exists a large gap in predictive model performance.

5.2 Image-to-AMR for Caption Generation

To better understand the quality of our generated AMRs, we use them to automatically generate sentences from caption AMRs (using an existing AMR-to-text model), and evaluate the quality of these generated sentences against the reference captions of their corresponding images. Specifically, we use the Spring AMR-to-text model that we train from scratch on a dataset composed of AMR2.0, plus the training MSCOCO captions paired with their (automatically-generated) AMRs. We evaluate the quality of our AMR-generated captions using standard metrics commonly used in the image captioning community, i.e., CideVedantam et al. (2015), Meteor Denkowski and Lavie (2014), Bleu-4 Papineni et al. (2002), and Spice Anderson et al. (2016), and compare against VL-Bart’s best captioning performance as reported in the original paper Cho et al. (2021). Reported in Table 2, the results clearly show that the quality of the generated AMRs are such that reasonably good captions can be generated from them, suggesting that AMRs can be used as intermediate representations for such downstream tasks. Future work will need to explore the possibility of further adapting the AMR formalism to the visual domain, as well as the possibility of enriching image AMRs via incorporating additional linguistic or commonsense knowledge, that could potentially result in better quality captions.

5.3 Performance per Concept Category

The analysis presented in Section 2 suggests many concepts in AMR graphs tend to be on the more abstract (less perceptual) side. We thus ask the following question: What are some of the categories that are harder to predict? To answer this question, we look into the node prediction performance of our two-stage model for the different entity and relation categories of Section 2. Note that this categorization is available for a subset of nodes only. To get the per-category recall and precision values, we take the node predictions of the first stage of the Img2Amr model (trained to predict meta-AMR nodes) on the val set. For each val image , we have a set of predicted nodes, which we compare to the set of nodes in the ground-truth meta-AMR associated with the image. When calculating per-category recall/precision values, we only consider nodes that belong to that category. We calculate per-image true positive, true negative, and false positive counts, which are used to obtain the recall and precision using micro-averaging. Fig. 4 presents the per-category (as well as overall) recall and precision values over the val set.

Interestingly, events (e.g., festival, baseball, tennis

) have the highest precision and recall. These are abstract concepts that are largely absent from SGs, suggesting that relying on a linguistically-motivated formalism is beneficial in capturing such abstract aspects of an image content. The event category contains

different types, many referring to sports that have a very distinctive setup, e.g., people wearing specific clothes, holding specific objects, etc. The possibility of encoding such abstract concepts in the training AMRs (generated from human-written descriptions likely to mention the event) helps the model learn to generate them for the relevant images during inference. The next group with high precision and recall are entities (which are likely to be more closely tied to the image regions), and possessives (containing a small number of high-frequency relations, e.g., have and wear). Semantic relations have a decent performance, but contain a diverse number of types, and need to be further analyzed to disentangle the effect of category vs. frequency.

Quantifiers (many of which are related to counting), geometric relations, and attributes seem to be particularly hard to predict. Counting is known to be hard for deep learning models. Geometric relations are much less frequent in AMRs, compared to SGs. Perhaps, we do need to rely on special features (e.g., relative position of bounding boxes) to improve performance on these relations. Attributes (such as

young, old, small) require the model to learn subtle visual cues. In addition to understanding what input features may help improve performance on these categories, we need to further adapt the AMR formalism to the visual domain.

Figure 4: Node prediction performance on val, for the two-stage model, broken down by category.
(a) A couple of giraffe standing next to each other in a field near rocks walking in grass in a grassy area.
(b) A yellow and blue fire hydrant on a city street in front at an intersection sitting on the side of the road near a traffic position.
(c) A large long passenger train going across a wooden beach plate, traveling and passing by water.
(d) A woman sitting at a table eating a sandwich and holding a hot dog in a building smiling while eating.
(e) A white area filled with lots of different kinds of donuts with various toppings sitting on them.
(f) A group of people sitting around at a dining table with water posing for a picture.
(g) A person in a red jacket cross country skiing down a snow covered ski slope with a couple of people riding skis and walking on the side of the snowy mountain.
(h) A person in black shirt sitting at a table in a building with a plate of food with and smiling while having meal.
Figure 5: A sample of images, along with descriptive captions automatically generated from the meta-AMRs predicted by our Img2Amr model. Refer to Section E for the generated meta-AMRs. The url and license information for each of these images is available in Section E. Faces were blurred for privacy.

5.4 Qualitative Samples: Generating Descriptive Captions from meta-AMRs

In Section 5.2, we showed that caption AMRs produced by our Img2Amr model can be used to generate reasonably good quality captions via an AMR-to-text model. Here, we provide samples of how meta-AMRs can be used as rich intermediate representations for generating descriptive captions; see Fig. 5 and Section E. To get these captions, we apply the same AMR-to-text model that we trained as described in Section 5.2 to the meta-AMRs predicted by our Img2Amr model. Captions generated from meta-AMRs tend to be longer than the original human-generated captions, and contain much more details about the scene. These captions, however, sometimes contain repetitions of the same underlying concept/relation (though using different wordings), e.g., caption (a) contains both in grass and in a grassy area. We also see that our hypernym replacement sometimes results in using a more general term in place of a more specific but more appropriate term, e.g., woman instead of girl in (d). Nonetheless, these results generally point to the usefulness of AMRs and especially meta-AMRs for scene representation and caption generation.

6 Discussion and Outlook

In this paper, we proposed to use a well-known linguistic semantic formalism, i.e., Abstract Meaning Representation (AMR) for scene understanding. We showed through extensive analysis the advantages of AMR vs. the commonly-used visual scene graphs, and proposed to re-purpose existing text-to-AMR parsers for image-to-AMR parsing. Additionally we proposed a graph transformation algorithm that merges several caption-level AMR graphs into a more descriptive meta-AMR graph. Our quantitative (intrinsic and extrinsic) and qualitative evaluations demonstrate the usefulness of (meta-)AMRs as a scene representation formalism.

Our findings point to a few exciting future research directions. Our image-to-AMR parsers can be improved by incorporating richer visual features, a better understanding of the entity and relation categories that are particularly hard to predict for our current models, as well as drawing on methods used for scene graph generation (e.g., Zellers et al., 2018; Zhu et al., 2022). Our meta-AMR generation algorithm can be further tuned to capture visually-salient information (e.g., quantifiers are too hard to learn from images, and perhaps can be dropped from a visual AMR formalism).

Our qualitative samples of captions generated from meta-AMRs show their potential for generating descriptive and/or controlled captions. Controllable image captioning has received a great deal of attention lately (e.g., Cornia et al., 2019; Chen et al., 2020, 2021)

. It focuses on the use of subjective control, including personalization and style-focused caption generation, as well as objective control on content (controlling what the caption is about, e.g., focused on a set of regions), or on the structure of the output sentence (e.g., controlling sentence length). We believe that by using AMRs as intermediate scene representations, we can bring together the work on these various types of control, as well as draw on the literature on controllable natural language generation 

Zhang et al. (2022) for advancing research on rich caption generation.


  • P. Anderson, B. Fernando, M. Johnson, and S. Gould (2016) Spice: semantic propositional image caption evaluation. In European Conference on Computer Vision, Cited by: §5.2.
  • P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L. Zhang (2018) Bottom-up and top-down attention for image captioning and visual question answering. In

    IEEE Conference on Computer Vision and Pattern Recognition

    Cited by: §3.1.
  • L. Banarescu, C. Bonial, S. Cai, M. Georgescu, K. Griffitt, U. Hermjakob, K. Knight, P. Koehn, M. Palmer, and N. Schneider (2013) Abstract meaning representation for sembanking. In 7th Linguistic Annotation Workshop and Interoperability with Discourse, Cited by: §1.
  • M. Bevilacqua, R. Blloshmi, and R. Navigli (2021) One SPRING to rule them both: symmetric AMR semantic parsing and generation without a complex pipeline. In

    Association for the Advancement of Artificial Intelligence

    Cited by: §1, §3.1, §4.
  • C. Bonial, L. Donatelli, M. Abrams, S. M. Lukin, S. Tratz, M. Marge, R. Artstein, D. Traum, and C. Voss (2020) Dialogue-AMR: Abstract Meaning Representation for dialogue. In Proceedings of the 12th Language Resources and Evaluation Conference, Cited by: §1.
  • J. Bonn, M. Palmer, Z. Cai, and K. Wright-Bettner (2020) Spatial AMR: expanded spatial annotation in the context of a grounded Minecraft corpus. In Proceedings of the 12th Language Resources and Evaluation Conference (LREC), Cited by: §1.
  • S. Cai and K. Knight (2013)

    Smatch: an evaluation metric for semantic feature structures

    In 51st Annual Meeting of the Association for Computational Linguistics, Cited by: §5.1.
  • N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko (2020) End-to-end object detection with transformers. In European Conference on Computer Vision, Cited by: §1.
  • L. Chen, Z. Jiang, J. Xiao, and W. Liu (2021) Human-like controllable image captioning with verb-specific semantic roles. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §6.
  • S. Chen, Q. Jin, P. Wang, and Q. Wu (2020) Say as you wish: fine-grained control of image caption generation with abstract scene graphs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §6.
  • J. Cho, J. Lei, H. Tan, and M. Bansal (2021) Unifying vision-and-language tasks via text generation. In

    International Conference on Machine Learning

    Cited by: 1st item, §3.1, §5.2.
  • W. S. Choi, Y. Heo, D. Punithan, and B. Zhang (2022) Scene graph parsing via Abstract Meaning Representation in pre-trained language models. In Proceedings of the 2nd Workshop on Deep Learning on Graphs for Natural Language Processing (DLG4NLP 2022), Cited by: §2.
  • M. Cornia, L. Baraldi, and R. Cucchiara (2019) Show, control, and tell: a framework for generating controllable and grounded captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §6.
  • V. Damodaran, S. Chakravarthy, A. Kumar, A. Umapathy, T. Mitamura, Y. Nakashima, N. Garcia, and C. Chu (2021) Understanding the role of scene graphs in visual question answering. arXiv preprint arXiv:2101.05479. Cited by: §2.
  • M. Denkowski and A. Lavie (2014) Meteor universal: language specific translation evaluation for any target language. In Workshop on Statistical Machine Translation, Cited by: §5.2.
  • A. Drozdov, J. Zhou, R. Florian, A. McCallum, T. Naseem, Y. Kim, and R. F. Astudillo (2022) Inducing and using alignments for transition-based AMR parsing. In North American Chapter of the Association for Computational Linguistics, Cited by: §1.
  • L. Fei-Fei, A. Iyer, C. Koch, and P. Perona (2007) What do we perceive in a glance of a real-world scene?. Journal of Vision 7 (1). Cited by: §2.
  • C. Fellbaum (Ed.) (1998) WordNet: an electronic lexical database. Cambridge, MA: MIT Press. Cited by: §3.2.
  • R. Gao, B. Xiong, and K. Grauman (2018) Im2flow: motion hallucination from static images for action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §1.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §3.1.
  • S. Herath, M. Harandi, and F. Porikli (2017) Going deeper into action recognition: a survey. Image and vision computing 60. Cited by: §1.
  • M. Hildebrandt, H. Li, R. Koner, V. Tresp, and S. Günnemann (2020) Scene graph reasoning for visual question answering. arXiv preprint arXiv:2007.01072. Cited by: §2.
  • Z. Hou, X. Peng, Y. Qiao, and D. Tao (2020) Visual compositional learning for human-object interaction detection. In European Conference on Computer Vision, Cited by: §1.
  • J. Johnson, R. Krishna, M. Stark, L. J. Li, D. A. Shamma, and F. F. Li (2015a) Image retrieval using scene graphs. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §1.
  • J. Johnson, R. Krishna, M. Stark, L. Li, D. Shamma, M. Bernstein, and L. Fei-Fei (2015b) Image retrieval using scene graphs. In Proceedings of the IEEE conference on computer vision and pattern recognition, Cited by: §2.
  • A. Karpathy and L. Fei-Fei (2015) Deep visual-semantic alignments for generating image descriptions. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §4.
  • Y. Kong and Y. Fu (2022) Human action recognition and prediction: a survey. International Journal of Computer Vision 130 (5). Cited by: §1.
  • R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. Kalantidis, L. Li, D. A. Shamma, M. Bernstein, and L. Fei-Fei (2016) Visual genome: connecting language and vision using crowdsourced dense image annotations. Cited by: §2.
  • K. Lee, X. Chen, G. Hua, H. Hu, and X. He (2018) Stacked cross attention for image-text matching. In European Conference on Computer Vision, Cited by: §4.
  • M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mohamed, O. Levy, V. Stoyanov, and L. Zettlemoyer (2020) BART: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Association for Computational Linguistics, Cited by: §3.1.
  • X. Li and S. Jiang (2019) Know more say less: image captioning based on scene graphs. IEEE Transactions on Multimedia 21 (8), pp. 2117–2130. Cited by: §2.
  • Y. Li, C. Fang, J. Yang, Z. Wang, X. Lu, and M. Yang (2018) Flow-grounded spatial-temporal video prediction from still images. In Proceedings of the European Conference on Computer Vision, Cited by: §1.
  • K. Liao, L. Lebanoff, and F. Liu (2018) Abstract Meaning Representation for multi-document summarization. In the 27th International Conference on Computational Linguistics, Cited by: §3.2.
  • T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick (2014)

    Microsoft COCO: common objects in context

    In European Conference on Computer Vision, Cited by: Appendix A.
  • F. Liu, J. Flanigan, S. Thomson, N. Sadeh, and N. A. Smith (2015) Toward abstractive summarization using semantic representations. In North American Chapter of the Association for Computational Linguistics, Cited by: §3.2.
  • L. Liu, W. Ouyang, X. Wang, P. Fieguth, J. Chen, X. Liu, and M. Pietikäinen (2020a) Deep learning for generic object detection: a survey. International Journal of Computer Vision 128 (2). Cited by: §1.
  • L. Liu, H. Jiang, P. He, W. Chen, X. Liu, J. Gao, and J. Han (2020b)

    On the variance of the adaptive learning rate and beyond

    In International Conference on Learning Representations, Cited by: §4.
  • W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C. Fu, and A. C. Berg (2016) Ssd: single shot multibox detector. In European Conference on Computer Vision, Cited by: §1.
  • C. Lu, R. Krishna, M. Bernstein, and L. Fei-Fei (2016) Visual relationship detection with language priors. In European Conference on Computer Vision, Cited by: §1.
  • E. M. Markman (1990) Constraints children place on word meanings. Cognitive Science 14. Cited by: §2.
  • L. Màrquez, X. Carreras, K. C. Litkowski, and S. Stevenson (2008) Special issue introduction: semantic role labeling: an introduction to the special issue. Computational Linguistics 34 (2). Cited by: §1.
  • T. Naseem, A. Blodgett, S. Kumaravel, T. O’Gorman, Y. Lee, J. Flanigan, R. F. Astudillo, R. Florian, S. Roukos, and N. Schneider (2021) DocAMR: multi-sentence AMR representation and evaluation. In arXiv, Cited by: §1, §3.2.
  • K. Papineni, S. Roukos, T. Ward, and W. Zhu (2002) Bleu: a method for automatic evaluation of machine translation. In 40th Annual Meeting of the Association for Computational Linguistics, Cited by: §5.1, §5.2.
  • S. Ren, K. He, R. Girshick, and J. Sun (2015) Faster R-CNN: towards real-time object detection with region proposal networks. Advances in Neural Information Processing Systems 28. Cited by: §1, §3.1.
  • J. Ruppenhofer, M. R. L. P. Michael Ellsworth, C. R. Johnson, C. F. Baker, and J. Scheffczyk (2016) FrameNet ii: extended theory and practice. Cited by: §1.
  • B. Schroeder and S. Tripathi (2020) Structured query-based image retrieval using scene graphs. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pp. 178–179. Cited by: §2.
  • S. Schuster, R. Krishna, A. Chang, L. Fei-Fei, and C. D. Manning (2015) Generating semantically precise scene graphs from textual descriptions for improved image retrieval. In Proceedings of the fourth workshop on vision and language, pp. 70–80. Cited by: §2.
  • L. Song and D. Gildea (2019) SemBleu: a robust metric for AMR parsing evaluation. In 57th Annual Meeting of the Association for Computational Linguistics, Cited by: §5.1.
  • R. Vedantam, C. L. Zitnick, and D. Parikh (2015) CIDEr: consensus-based image description evaluation. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §5.2.
  • S. Wang, R. Wang, Z. Yao, S. Shan, and X. Chen (2020) Cross-modal scene graph matching for relationship-aware image-text retrieval. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pp. 1508–1517. Cited by: §2.
  • Q. Xia, Z. Li, R. Wang, and M. Zhang (2021) Stacked AMR parsing with silver data. In Findings of the Association for Computational Linguistics: EMNLP, Cited by: §1, 2nd item.
  • X. Yang, K. Tang, H. Zhang, and J. Cai (2019) Auto-encoding scene graphs for image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10685–10694. Cited by: §2.
  • M. Yatskar, L. Zettlemoyer, and A. Farhadi (2016) Situation recognition: visual semantic role labeling for image understanding. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §1.
  • R. Zellers, M. Yatskar, S. Thomson, and Y. Choi (2018) Neural motifs: scene graph parsing with global context. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: Appendix A, §1, §2, §2, §6.
  • C. Zhang, W. Chao, and D. Xuan (2019a) An empirical study on leveraging scene graphs for visual question answering. arXiv preprint arXiv:1907.12133. Cited by: §2.
  • H. Zhang, H. Song, S. Li, M. Zhou, and D. Song (2022) A survey of controllable text generation using transformer-based pre-trained language models. arXiv: Cited by: §6.
  • H. Zhang, Z. Kyaw, S. Chang, and T. Chua (2017) Visual translation embedding network for visual relation detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §1.
  • S. Zhang, X. Ma, K. Duh, and B. Van Durme (2019b) AMR parsing as sequence-to-graph transduction. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Cited by: §1, 2nd item.
  • Y. Zhong, L. Wang, J. Chen, D. Yu, and Y. Li (2020) Comprehensive image captioning via scene graph decomposition. In European Conference on Computer Vision, pp. 211–229. Cited by: §2.
  • G. Zhu, L. Zhang, Y. Jiang, Y. Dang, H. Hou, P. Shen, M. Feng, X. Zhao, Q. Miao, S. A. A. Shah, et al. (2022) Scene graph generation: a comprehensive survey. arXiv preprint arXiv:2201.00443. Cited by: §5.1, §6.
  • C. Zou, B. Wang, Y. Hu, J. Liu, Q. Wu, Y. Zhao, B. Li, C. Zhang, C. Zhang, Y. Wei, et al. (2021) End-to-end human object interaction detection with hoi transformer. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, Cited by: §1.

Appendix A AMR vs. SG: Entity and Relation Categorization Details

The analysis provided in Section 2 requires us to annotate the entities and relations of a sample of AMRs and SGs into a pre-defined set of categories. We first select all images that appear in both MSCOCO Lin et al. (2014) and Visual Genome, so we have access to ground-truth scene graphs, as well as captions from which we can generate AMR graphs for the same set of images. We use a single AMR per image, generated from the longest caption, but include all SGs associated with an image in our analysis. For each SG and AMR graph, we consider the entities and relations corresponding to most frequent types (around M entity and M relation instances for SGs; and around K entity and K relation instances for AMRs). We annotate these into a pre-defined set of entity and relation categories, including those defined by Zellers et al. (2018) plus a few we add to cover new AMR relations. Table 5 provides a breakdown of the categories, as well as examples of word types we considered to belong to each category. The table also provides the total number of word types per category and percentages of instances across all types for each category.

Next, we describe our annotation process. SG nodes (entities) come with their most common WordNet sense annotations, which we use to identify their categories. For SG relations, we manually annotate their categories. To annotate AMR entities and relations, we follow a similar procedure, by automatically finding the most common WordNet sense for non-predicate AMR nodes (assuming most of these will be entities) and correcting them if needed. For example, the automatically-identified most common sense of mouse is the Animal sense, whereas in our captions, almost all instances of the word point to the computer mouse (Artifact). For any remaining concepts, including predicate nodes (e.g., eat, stand) and entities for which a category cannot be assigned automatically, we manually identify their categories.

Appendix B Distribution of AMR Node Types

Fig. 6 shows the distribution of the AMR role/edge types in our training data. As we can see, keeping the top-

types is justified given the skewed distribution of the types. Future work will need to examine the nature of the less frequent relations, and the implications of removing them from AMR graphs.

Figure 6: Frequency of the AMR role/edge types prior to the refinement process, which exhibits the characteristics of a long-tail distribution.

Appendix C Meta-AMR Construction Example

Fig. 7 shows an example of how a meta-AMR is constructed from five caption-level AMRs. The corresponding captions are provided in red, and the AMR graphs are given in PENMAN notation.

Appendix D Ablations

Effect of input on node prediction performance.

Table 3 presents performance of meta-AMR node prediction (first stage of Img2Amr) with different input combinations, in terms of Precision and Recall (when predicted and ground-truth nodes are taken as sets), and Bleu-1 (when the order of nodes in the final linearized AMR is taken into consideration). These results suggest that an overall best performance is achieved by using all input features, namely regions, tags and global image feature.

{} {} {} Recall Precision Bleu-1
- - 34.5 47.1 33.1
- - 30.4 42.8 29.7
- - 30.6 39.9 29.1
- 35.8 49.0 34.3
- 35.1 47.5 33.9
- 32.9 46.5 32.1
36.7 48.4 35.6
Table 3: val performance of meta-AMR node prediction (first stage of Img2Amr) with different input combinations.

Effect of input on parsing performance.

We train our Img2Amr models with different inputs to the encoders, and evaluate on val set. Specifically, the input to the model may contain the global image feature , region embeddings , tag embeddings (for the first encoder), and node embeddings (for the second encoder of Img2Amr). Table 4 reports the val results of our two models (trained and tested with meta-AMRs) with different input combinations (region embeddings, tag embeddings, global image features) for the model, and (node embeddings, global image features, region embeddings) for the second encoder of the model. For Img2Amr, we fix the input of the first encoder to the best combination according to Table 3 above, and ablate over the input of the second encoder. Both models are trained and tested with meta-AMRs. As we can see, richer input generally results in better performance. We can also see a big drop in the performance of Img2Amr when only region features are used as input, suggesting that tags can help associate mappings between regions and AMR concepts.

Model Input Smatch SemBleu-1 SemBleu-2
{} 30.3 18.6 5.4
{} 39.1 32.9 16.2
{} 39.0 33.7 16.4
{} 39.3 31.3 16.1
{} 39.6 31.9 16.3
{} 40.4 32.6 16.9
Table 4: Ablation over model inputs on val, for both Img2Amr models. For Img2Amr we use all features {} as the 1st encoder input.
#Types %Tokens
Cateogry Example Types per Category AMR SG AMR SG
Artifact clock, umbrella, bottle 128 128 22.7 24.4
Part eyes, finger, wing 21 44 3.1 13.1
Location beach, mountain, kitchen 86 52 20.7 11.2
Person man, women, speaker 30 19 17.9 11
Flora/Nature ocean, tree, flower 20 34 6.1 10.2
Clothing dress, scarf, suit 11 31 1.1 7.7
Food orange, donut, bread 52 23 8 2.8
Animal horse, bird, cat 16 20 6.4 4.7
Vehicle car, motorcycle, bicycle 18 17 6.1 4.5
Furniture table, chair, couch 9 10 4.0 2.9
Structure window, tower, circle 13 18 2.1 5.4
Building brick, house, cement 6 6 1.8 2.1
Geometric down, edge, between 48 122 12.4 56.6
Possessive have, wear, contain 5 42 5.9 30.6
Semantic attempt, carry, eat 183 275 38.3 11.6
Attribute Color color, white, blue 13 8 5.6 0.1
Attribute young, small, colorful 82 - 12.8 -
AMR specific and, or, date-entity 8 - 11.1 -
Quantifier more, both, few 31 1 9.3 0.1
Event soccer, party, festival 14 - 3.4 -
Misc they, something, you 6 13 1.1 1.0
Table 5: The list of AMR and SG entity and relation categories, as well as examples of word types, number of types, and percentage of tokens per category.
Figure 7: An example of five caption AMRs and their corresponding meta-AMR. Captions are marked as red.

Appendix E Generated AMRs for the Qualitative Samples

(a) A couple of giraffe standing next to each other in a field near rocks walking in grass in a grassy area.
(b) A yellow and blue fire hydrant on a city street in front at an intersection sitting on the side of the road near a traffic position.
(c) A large long passenger train going across a wooden beach plate, traveling and passing by water.
Figure 8: Images used in Section 5.4, along with their predicted AMRs and generated captions. Refer to Section 5.4 for more details.
(a) A woman sitting at a table eating a sandwich and holding a hot dog in a building smiling while eating.
(b) A white area filled with lots of different kinds of donuts with various toppings sitting on them.
(c) A group of people sitting around at a dining table with water posing for a picture.
Figure 9: (cont) Images used in Section 5.4, along with their predicted AMRs and generated captions. Refer to Section 5.4 for more details.
(a) A person in a red jacket cross country skiing down a snow covered ski slope with a couple of people riding skis and walking on the side of the snowy mountain.
(b) A person in black shirt sitting at a table in a building with a plate of food with and smiling while having meal.
Figure 10: (cont) Images used in Section 5.4, along with their predicted AMRs and generated captions. Refer to Section 5.4 for more details.