1 Introduction
Abstract Meaning Representation (AMR) Banarescu et al. (2013) is a semantic formalism where the meaning of a sentence is encoded as a rooted, directed graph. Figure 1 shows two AMR graphs in which the nodes (such as “girl” and “leave11”) represent AMR concepts and the edges (such as “ARG0” and “ARG1”) represent relations between the concepts. The task of parsing sentences into AMRs has received increasing attention, due to the demand for better natural language understanding.
Despite the large amount of work on AMR parsing Flanigan et al. (2014); Artzi et al. (2015); Pust et al. (2015); Peng et al. (2015); Buys and Blunsom (2017); Konstas et al. (2017); Wang and Xue (2017); Ballesteros and AlOnaizan (2017); Lyu and Titov (2018); Peng et al. (2018); Groschwitz et al. (2018); Guo and Lu (2018), little attention has been paid to evaluating the parsing results, leaving Smatch Cai and Knight (2013) as the only overall performance metric. Damonte et al. (2017) developed a suite of finegrained performance measures based on the node mappings of Smatch (see below).
Smatch suffers from two major drawbacks: first, it is based on greedy hillclimbing to find a onetoone node mapping between two AMRs (finding the exact best mapping is NPcomplete). The search errors weaken its robustness as a metric. To enhance robustness, the hillclimbing search is executed multiple times with random restarts. This decreases efficiency and, more importantly, does not eliminate search errors. Figure 2 shows the means and error bounds of Smatch scores as a function of the number of restarts
over 100 runs on 100 sentences. We can see that the variances are still significant when
is large. Furthermore, by corresponding with other researchers, we have learned that previous papers on AMR parsing report Smatch scores using differing values of .Another problem is that Smatch maps one node to another regardless of their actual content, and it only considers edge labels when comparing two edges. As a result, two different edges, such as “ask01 :ARG1 leave11” and “make01 :ARG1 pie” in Figure 1, can be considered identical by Smatch. This can lead to a overly large score for two completely different AMRs. As shown in Figure 1, Smatch gives a score of 25% for the two AMRs meaning “The girl asked the boy to leave” and “The woman made two pies”, which convey obviously different meanings.^{1}^{1}1https://amr.isi.edu/eval/smatch/compare.html gives more details. The situation could be worse for two different AMRs with few types of edge labels, where the score could reach 50% if all pairs of edges between them were accidentally matched.
To tackle the problems above, we introduce SemBleu, an accurate metric for comparing AMR graphs. SemBleu extends Bleu Papineni et al. (2002)
, which has been shown to be effective for evaluating a wide range of text generation tasks, such as machine translation and datatotext generation. In general, a
Bleu score is a precision score calculated by comparing the grams ( is up to 4) of a predicted sentence to those of a reference sentence. To punish very short predictions, it is multiplied by a brevity penalty, which is less than 1.0 for a prediction shorter than its reference. To adapt Bleu for comparing AMRs, we treat each AMR node (such as “ask01”) as a unigram, and we take each pair of directly connected AMR nodes with their relation (such as “ask01 :ARG0 girl”) as a bigram. Higherorder grams (such as “ask01 :ARG1 leave11 :ARG0 boy”) are defined in a similar way.SemBleu has several advantages over Smatch. First, it gives an exact score for each pair of AMRs without search errors. Second, it is very efficient to calculate. On a dataset of 1368 pairs of AMRs, SemBleu takes 0.5 seconds, while Smatch takes almost 2 minutes using the same machine. Third, it captures highorder relations in addition to nodetonode and edgetoedge correspondences. This gives complementary judgments to the standard Smatch metric for evaluating AMR parsing quality. Last, it does not give overly large credit to AMRs that represent completely different meanings.
Our initial evaluations suggest that SemBleu has higher consistency with human judgments than Smatch on both corpuslevel and sentencelevel evaluations. We also show that the number of grams extracted by SemBleu is roughly linear in the AMR scale. Evaluation on the outputs of several recent models show that SemBleu is mostly consistent with Smatch for results ranking, but with occasional disagreements.
2 Our metric
Our method is based on Bleu, which we briefly introduce, before showing how to extend it for matching AMR graphs.
2.1 Preliminary knowledge on Bleu
As shown in Equation 1, the Bleu score for candidate and reference is calculated by multiplying a modified precision with a brevity penalty ().
(1) 
is defined as , which gives a value of less than 1.0 when the candidate length () is smaller than the reference length (). and are the precision and weight for matching grams, and is defined as
(2) 
where is the function for extracting all grams from its input.
2.2 SemBleu
To introduce SemBleu, we make the following changes to adapt Bleu to AMR graphs. First, we define the size of each AMR () as the number of nodes plus the number of edges: . This size is used to calculate the brevity penalty (). Intuitively, edges carry important relational information. Also, we observed many situations where a systemgenerated AMR preserves most of the concepts in the reference, but misses many edges.
Another change is to the gram extraction function ( in Equation 2). AMRs are directed acyclic graphs, thus we start extracting grams from the roots. This is analogous to starting to extract plain grams from sentence left endpoints. Note that the order of each gram is determined only by the number of nodes within it. For instance, “ask01 :ARG0 girl” is considered as a bigram, not a trigram.
Our gram extraction method adopts breadthfirst traversal to enumerate every possible starting node for extracting grams. From each starting node , it extracts all possible grams () beginning from . At each node, it first stores the current gram before enumerating every descendant of the node and moving on. Taking the AMR graphs in Figure 1 as examples, the grams extracted by our method are shown in Table 1.
Fg  Extracted grams  

(a)  1  ask01; girl; leave11; boy 
23  2  ask01 :ARG0 girl; 
ask01 :ARG1 leave11;  
leave11 :ARG0 boy;  
23  3  ask01 :ARG1 leave11 :ARG0 boy; 
(b)  1  woman; make01; pie; 2 
23  2  make01 :ARG0 woman; 
make01 :ARG1 pie;  
pie :quant 2;  
23  3  make01 :ARG1 pie :quant 2; 
Metric  CAMR vs JAMR  CAMR vs Gros  CAMR vs Lyu  JAMR vs Gros  JAMR vs Lyu  Gros vs Lyu 

Smatch  67.9  99.9  100.0  100.0  100.0  90.3 
SemBleu  69.0  99.9  100.0  100.0  100.0  90.9 
Processing inverse relations One important characteristic of AMR is the inverse relations, such as “ask01 :ARG0 girl” “girl :ARG0of ask01”, for preserving the properties of being rooted and acyclic. Both the original and inverse relations carry the same semantic meaning. Following Smatch, we unify both types of relations by reverting all inverse relations to their original ones, before calculating SemBleu scores.
Efficiency As an important factor, the efficiency of SemBleu largely depends on the number of extracted grams. One potential problem is that there can be a large number of extracted grams for very dense graphs. For a fully connected graph with nodes, there are possible grams. Luckily, AMRs are treelike graphs Chiang et al. (2018) that are very sparse. For a tree with nodes, the number of grams is bounded by , which is linear in the tree scale. As treelike graphs, we expect the number of grams extracted from AMRs to be roughly linear in the graph scale. Our experiments empirically confirm this expectation.
2.3 Comparison with Smatch
In general, Smatch breaks down the problem of comparing two AMRs into comparing the smallest units: nodes and edges. It treats each AMR as a bag of nodes and edges, and then calculates an F1 score regarding the correctly mapped nodes and edges. Given two AMRs, Smatch searches for onetoone mappings between the graph nodes by maximizing the overall F1 score, and the edgetoedge mappings are automatically determined by the nodetonode mappings. Since obtaining the optimal mapping is NPcomplete (by reduction from subgraph isomorphism), it uses a greedy hillclimbing algorithm to find a mapping, which is likely to be suboptimal.
One key difference is that SemBleu generally considers more global features than Smatch. The only features that both metrics have in common are the nodetonode correspondences (also called unigrams for SemBleu). Each bigram of SemBleu consists two AMR nodes and one edge that connects them, thus the bigrams already capture larger contexts than Smatch. In addition, the higherorder grams of SemBleu capture even larger correspondences. This can be a tradeoff. Generally, more highorder matches indicate better parsing performance, but sometimes we want to give partial credit for distinguishing partially correct results from the fully wrong ones. As a result, combining Smatch with SemBleu may give more comprehensive judgment.
Another difference is the way to determine edge (relation) equivalence. Smatch only checks edge labels, thus two edges with the same label but conveying different meanings can be considered as equivalent by Smatch.^{2}^{2}2One example is shown in the Smatch tutorial https://amr.isi.edu/eval/smatch/tutorial.html. On the other hand, SemBleu considers not only the edge labels but also the content of their heads and tails, as shown by the extracted grams in Table 1.
Take the AMRs in Figure 1 as an example, Smatch maps “girl”, “ask01” and “leave11” in (a) to “woman”, “make01” and “pie” in (b). As a result, it considers that “ask01 :ARG0 girl” and “ask01 :ARG1 leave11” in (a) are correctly mapped to “make01 :ARG0 woman” and “make01 :ARG1 pie” in (b), which does not make sense. Conversely, SemBleu does not consider that these edges are correctly matched.
3 Experiments
We compare SemBleu with Smatch on the outputs of 4 systems over 100 sentences from the testset of LDC2015E86. These systems are: CAMR,^{3}^{3}3https://github.com/camr/camr JAMR,^{4}^{4}4https://github.com/jflanigan/jamr Gros (Groschwitz et al., 2018) and Lyu (Lyu and Titov, 2018). For each sentence, following CallisonBurch et al. (2010), annotators decide relative orders instead of a complete order over all systems. In particular, 4 system outputs are randomly grouped into 2 pairs to make 2 comparisons. For each pair, we ask 3 annotators to decide which one is better and choose the majority vote as the final judgment. All the annotators have several years experience on AMRrelated research, and the judgments are based on their impression on how well a systemgenerated AMR retains the meaning of the reference AMR. Out of the 200 comparisons, annotators are fully agree on 142, accounting for 71%. With the judgments, we study consistencies of both metrics on sentence and corpus levels.
We consider all unigrams, bigrams and trigrams for SemBleu, and the weights (s in Equation 1) are equivalent (1/3 for each). For sentencelevel evaluation, we follow previous work to use NIST geometric smoothing (Chen and Cherry, 2014). Following Smatch, inverse relations such as “ARG0of”, are reversed before extracting grams for making comparisons.
3.1 Corpuslevel experiment
For corpuslevel comparison, we assign each system a human score equal to the number of times that system’s output was preferred.
Our four systems achieved human scores of 30, 33, 63 and 74. They achieved SemBleu scores of 28, 30, 38 and 41, respectively, and Smatch scores of 56, 56, 63 and 67, respectively. SemBleu is generally more consistent with the human judgments. In particular, there is a tie between CAMR and JAMR for Smatch scores, while SemBleu scores are more discriminating. We use the scriptdefault 2 significant digits when calculating Smatch scores, as their variance can be very large (Figure 2). To make fair comparison, we also use 2 significant digits for SemBleu.
Bootstrap tests To conduct more comprehensive comparisons, we use bootstrap resampling (Koehn, 2004) to obtain 1000 new datasets, each having 100 instances. Every dataset contains the references, 4 system outputs and the corresponding human scores. Using the new datasets, we check how frequently SemBleu and Smatch are consistent with human judgments on the corpus level as a way to perform significant test.
Table 2 shows the accuracies of both metrics across all 6 system pairs (such as CAMR vs Lyu). Overall, SemBleu is equal to or slightly better than Smatch across all system pairs. The advantages are not significant at , perhaps because of the small data size, yet human judgments on largescale data is very time consuming. Comparatively, the precisions of both metrics on CAMR vs JAMR is lower than the other system pairs. It is likely because the gaps of this system pair on both human and metric scores are much smaller than the other system pairs. Still, SemBleu is better than Smatch on this system pair, showing that it may be more consistent with human evaluation.
3.2 Sentencelevel experiment
Metric  Percent (%) 

Smatch  76.5 
SemBleu  81.5 
SemBleu (=1)  69.5 
SemBleu (=2)  78.0 
SemBleu (=4)  80.0 
For sentencelevel comparison, we calculate the frequency with which a metric is consistent with human judgments on a pair of sentences. Recall that we make two pairs out of the 4 outputs for each sentence, thus there are 200 pairs in total.
As shown in the upper group of Table 3, SemBleu is 5.0 points better than Smatch, meaning that it makes 10 more correct evaluations than Smatch over the 200 instances. This indicates that SemBleu is more consistent with human judges than Smatch. The lower group shows SemBleu accuracies with different order . With only unigram features (nodetonode correspondences), SemBleu is much worse than Smatch. When incorporating bigrams and trigrams, SemBleu gives consistently better numbers, demonstrating the usefulness of highorder features. Further increasing leads to a decrease of accuracy. This is likely because humans care more about the wholegraph quality than occasional highorder matches.
3.3 Analysis on gram numbers
Figure 3 shows the number of extracted grams as a function of the number of AMR nodes on the devset of the LDC2015E86 dataset, which has 1368 instances. The number of extracted unigrams is exactly the number of AMR nodes, which is expected. The data points become less concentrated from bigrams to trigrams. This is because the number of grams depends on not only the graph scale, but also how dense the graph is. Overall, the amount of extracted grams is roughly linear in the number of nodes in the graph.
3.4 Evaluating with SemBleu
Data  Model  SemBleu  Smatch 
LDC2015E86  Lyu  52.7  73.7 
Guo  50.1  68.7  
Gros  50.0  70.2  
JAMR  46.8  67.0  
CAMR  37.2  62.0  
LDC2016E25  Lyu  54.3  74.4 
van Nood  49.2  71.0  
LDC2017T10  Guo  52.0  69.8 
Gros  50.7  71.0  
JAMR  47.0  66.0  
CAMR  36.6  61.0 
Table 4 shows the SemBleu and Smatch scores several recent models. In particular, we asked for the outputs of Lyu (Lyu and Titov, 2018), Gros (Groschwitz et al., 2018), van Nood (van Noord and Bos, 2017) and Guo (Guo and Lu, 2018) to evaluate on our SemBleu. For CAMR and JAMR, we obtain their outputs by running the released systems. SemBleu is mostly consistent with Smatch, except for the order between Guo and Gros
. It is probably because
Guo has more highorder correspondences with the reference.4 Conclusion
While one might expect a tradeoff between speed and correlation with human judgments, SemBleu appears to outperform Smatch in both dimensions. The improvement in correlation with human judgments comes from the fact that SemBleu considers larger fragments of the input graphs. The improvement in speed comes from avoiding the search over mappings between the two graphs. In practice, vertex mappings can be identified by simply considering the vertex labels, and the labels of their neighbors, through the grams in which they appear. SemBleu can be potentially used to compare other types of graphs, including cyclic graphs.
Acknowledgments
We are very grateful to Lisa Jin and Parker Riley for making annotations. We thank Zhiguo Wang (Amazon AWS), Jinsong Su (Xiamen University) and the anonymous reviewers for their insightful comments. Research supported by NSF award IIS1813823.
References

Artzi et al. (2015)
Yoav Artzi, Kenton Lee, and Luke Zettlemoyer. 2015.
Broadcoverage CCG semantic parsing with AMR.
In
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing
, pages 1699–1710.  Ballesteros and AlOnaizan (2017) Miguel Ballesteros and Yaser AlOnaizan. 2017. AMR parsing using stackLSTMs. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 1269–1275.
 Banarescu et al. (2013) Laura Banarescu, Claire Bonial, Shu Cai, Madalina Georgescu, Kira Griffitt, Ulf Hermjakob, Kevin Knight, Philipp Koehn, Martha Palmer, and Nathan Schneider. 2013. Abstract meaning representation for sembanking. In Proceedings of the 7th Linguistic Annotation Workshop and Interoperability with Discourse, pages 178–186.
 Buys and Blunsom (2017) Jan Buys and Phil Blunsom. 2017. Robust incremental neural semantic graph parsing. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1215–1226.
 Cai and Knight (2013) Shu Cai and Kevin Knight. 2013. Smatch: an evaluation metric for semantic feature structures. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 748–752.
 CallisonBurch et al. (2010) Chris CallisonBurch, Philipp Koehn, Christof Monz, Kay Peterson, Mark Przybocki, and Omar F Zaidan. 2010. Findings of the 2010 joint workshop on statistical machine translation and metrics for machine translation. In Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR, pages 17–53.
 Chen and Cherry (2014) Boxing Chen and Colin Cherry. 2014. A systematic comparison of smoothing techniques for sentencelevel BLEU. In Proceedings of the Ninth Workshop on Statistical Machine Translation, pages 362–367.
 Chiang et al. (2018) David Chiang, Frank Drewes, Daniel Gildea, Adam Lopez, and Giorgio Satta. 2018. Weighted DAG automata for semantic graphs. Computational Linguistics, 44(1):119–186.
 Damonte et al. (2017) Marco Damonte, Shay B. Cohen, and Giorgio Satta. 2017. An incremental parser for abstract meaning representation. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, pages 536–546.
 Flanigan et al. (2014) Jeffrey Flanigan, Sam Thomson, Jaime Carbonell, Chris Dyer, and Noah A. Smith. 2014. A discriminative graphbased parser for the abstract meaning representation. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1426–1436.
 Groschwitz et al. (2018) Jonas Groschwitz, Matthias Lindemann, Meaghan Fowlie, Mark Johnson, and Alexander Koller. 2018. AMR dependency parsing with a typed semantic algebra. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1831–1841.
 Guo and Lu (2018) Zhijiang Guo and Wei Lu. 2018. Better transitionbased AMR parsing with a refined search space. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 1712–1722.
 Koehn (2004) Philipp Koehn. 2004. Statistical significance tests for machine translation evaluation. In Proceedings of the 2004 conference on empirical methods in natural language processing, pages 388–395.
 Konstas et al. (2017) Ioannis Konstas, Srinivasan Iyer, Mark Yatskar, Yejin Choi, and Luke Zettlemoyer. 2017. Neural AMR: Sequencetosequence models for parsing and generation. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 146–157.
 Lyu and Titov (2018) Chunchuan Lyu and Ivan Titov. 2018. AMR parsing as graph prediction with latent alignment. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 397–407.
 van Noord and Bos (2017) Rik van Noord and Johan Bos. 2017. Neural semantic parsing by characterbased translation: Experiments with abstract meaning representations. Computational Linguistics in the Netherlands (CLIN), 7:93–108.
 Papineni et al. (2002) Kishore Papineni, Salim Roukos, Todd Ward, and WeiJing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics, pages 311–318.
 Peng et al. (2015) Xiaochang Peng, Linfeng Song, and Daniel Gildea. 2015. A synchronous hyperedge replacement grammar based approach for AMR parsing. In Proceedings of the Nineteenth Conference on Computational Natural Language Learning, pages 32–41.
 Peng et al. (2018) Xiaochang Peng, Linfeng Song, Daniel Gildea, and Giorgio Satta. 2018. Sequencetosequence models for cache transition systems. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1842–1852.
 Pust et al. (2015) Michael Pust, Ulf Hermjakob, Kevin Knight, Daniel Marcu, and Jonathan May. 2015. Parsing English into abstract meaning representation using syntaxbased machine translation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 1143–1154.
 Wang and Xue (2017) Chuan Wang and Nianwen Xue. 2017. Getting the most out of AMR parsing. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 1257–1268.
Comments
There are no comments yet.