1 Introduction
Abstract meaning representations (AMRs) Banarescu et al. (2013) are broadcoverage sentencelevel semantic representations. AMR encodes, among others, information about semantic relations, named entities, coreference, negation and modality.
The semantic representations can be regarded as rooted labeled directed acyclic graphs (see Figure 1
). As AMR abstracts away from details of surface realization, it is potentially beneficial in many semantic related NLP tasks, including text summarization
Liu et al. (2015); Dohare and Karnick (2017), machine translation Jones et al. (2012) and question answering Mitra and Baral (2016).AMR parsing has recently received a lot of attention (e.g., Flanigan et al. (2014); Artzi et al. (2015); Konstas et al. (2017)). One distinctive aspect of AMR annotation is the lack of explicit alignments between nodes in the graph (concepts) and words in the sentences. Though this arguably simplified the annotation process Banarescu et al. (2013), it is not straightforward to produce an effective parser without relying on an alignment. Most AMR parsers Damonte et al. (2017); Flanigan et al. (2016); Werling et al. (2015); Wang and Xue (2017); Foland and Martin (2017) use a pipeline where the aligner training stage precedes training a parser. The aligners are not directly informed by the AMR parsing objective and may produce alignments suboptimal for this task.
In this work, we demonstrate that the alignments can be treated as latent variables in a joint probabilistic model and induced in such a way as to be beneficial for AMR parsing. Intuitively, in our probabilistic model, every node in a graph is assumed to be aligned to a word in a sentence: each concept is predicted based on the corresponding RNN state. Similarly, graph edges (i.e. relations) are predicted based on representations of concepts and aligned words (see Figure 2
). As alignments are latent, exact inference requires marginalizing over latent alignments, which is infeasible. Instead we use variational inference, specifically the variational autoencoding framework of kingma2013auto. Using discrete latent variables in deep learning has proven to be challenging
Mnih and Gregor (2014); Bornschein and Bengio (2015). We use a continuous relaxation of the alignment problem, relying on the recently introduced GumbelSinkhorn construction Mena et al. (2018). This yields a computationallyefficient approximate method for estimating our joint probabilistic model of concepts, relations and alignments.
We assume injective alignments from concepts to words: every node in the graph is aligned to a single word in the sentence and every word is aligned to at most one node in the graph. This is necessary for two reasons. First, it lets us treat concept identification as sequence tagging at test time. For every word we would simply predict the corresponding concept or predict NULL to signify that no concept should be generated at this position. Secondly, GumbelSinkhorn can only work under this assumption. This constraint, though often appropriate, is problematic for certain AMR constructions (e.g., named entities). In order to deal with these cases, we recategorized AMR concepts. Similar recategorization strategies have been used in previous work Foland and Martin (2017); Peng et al. (2017).
The resulting parser achieves 74.4% Smatch score on the standard test set when using LDC2016E25 training set,^{1}^{1}1
The standard deviation across multiple training runs was 0.16%.
an improvement of 3.4% over the previous best result van Noord and Bos (2017). We also demonstrate that inducing alignments within the joint model is indeed beneficial. When, instead of inducing alignments, we follow the standard approach and produce them on preprocessing, the performance drops by 0.9% Smatch. Our main contributions can be summarized as follows:
we introduce a joint probabilistic model for alignment, concept and relation identification;

we demonstrate that a continuous relaxation can be used to effectively estimate the model;

the model achieves the best reported results.^{2}^{2}2The code can be accessed from https://github.com/ChunchuanLv/AMR_AS_GRAPH_PREDICTION
2 Probabilistic Model
In this section we describe our probabilistic model and the estimation technique. In section 3, we describe preprocessing and postprocessing (including concept recategorization, sense disambiguation, wikification and root selection).
2.1 Notation and setting
We will use the following notation throughout the paper. We refer to words in the sentences as , where is sentence length, for . The concepts (i.e. labeled nodes) are , where is the number of concepts and for . For example, in Figure 1, .^{3}^{3}3The probabilistic model is invariant to the ordering of concepts, though the order affects the inference algorithm (see Section 2.5). We use depthfirst traversal of the graph to generate the ordering. Note that senses are predicted at postprocessing, as discussed in Section 3.2 (i.e. go is labeled as go02).
A relation between ‘predicate concept’ and ‘argument concept’ is denoted by ; it is set to NULL if is not an argument of . In our example, and . We will use to denote all relations in the graph.
To represent alignments, we will use , where returns the index of a word aligned to concept . In our example, .
All three model components rely on bidirectional LSTM encoders Schuster and Paliwal (1997). We denote states of BiLSTM (i.e. concatenation of forward and backward LSTM states) as (). The sentence encoder takes pretrained fixed word embeddings, randomly initialized lemma embeddings, partofspeech and namedentity tag embeddings.
2.2 Method overview
We believe that using discrete alignments, rather than attentionbased models Bahdanau et al. (2015) is crucial for AMR parsing. AMR banks are a lot smaller than parallel corpora used in machine translation (MT) and hence it is important to inject a useful inductive bias. We constrain our alignments from concepts to words to be injective. First, it encodes the observation that concepts are mostly triggered by single words (especially, after recategorization, Section 3.1). Second, it implies that each word corresponds to at most one concept (if any). This encourages competition: alignments are mutuallyrepulsive. In our example, obligate is not lexically similar to the word must and may be hard to align. However, given that other concepts are easy to predict, alignment candidates other than must and the will be immediately ruled out. We believe that these are the key reasons for why attentionbased neural models do not achieve competitive results on AMR Konstas et al. (2017) and why stateoftheart models rely on aligners. Our goal is to combine best of two worlds: to use alignments (as in stateoftheart AMR methods) and to induce them while optimizing for the end goal (similarly to the attention component of encoderdecoder models).
Our model consists of three parts: (1) the concept identification model ; (2) the relation identification model and (3) the alignment model .^{4}^{4}4, and denote all parameters of the models. Formally, (1) and (2) together with the uniform prior over alignments form the generative model of AMR graphs. In contrast, the alignment model , as will be explained below, is approximating the intractable posterior within that probabilistic model.
In other words, we assume the following model for generating the AMR graph:
AMR concepts are assumed to be generated conditional independently relying on the BiLSTM states and surface forms of the aligned words. Similarly, relations are predicted based only on AMR concept embeddings and LSTM states corresponding to words aligned to the involved concepts. Their combined representations are fed into a biaffine classifier
Dozat and Manning (2017) (see Figure 2).The expression involves intractable marginalization over all valid alignments. As standard in variational autoencoders, VAEs Kingma and Welling (2014), we lowerbound the loglikelihood as
(1) 
where is the variational posterior (aka the inference network), refers to the expectation under and is the KullbackLiebler divergence. In VAEs, the lower bound is maximized both with respect to model parameters ( and in our case) and the parameters of the inference network (
). Unfortunately, gradientbased optimization with discrete latent variables is challenging. We use a continuous relaxation of our optimization problem, where realvalued vectors
(for every concept ) approximate discrete alignment variables. This relaxation results in lowvariance estimates of the gradient using the parameterization trick
Kingma and Welling (2014), and ensures fast and stable training. We will describe the model components and the relaxed inference procedure in detail in sections 2.7 and 2.6.Though the estimation procedure requires the use of the relaxation, the learned parser is straightforward to use. Given our assumptions about the alignments, we can independently choose for each word (
) the most probably concept according to
. If the highest scoring option is NULL, no concept is introduced. The relations could then be predicted relying on . This would have led to generating inconsistent AMR graphs, so instead we search for the highest scoring valid graph (see Section 3.2). Note that the alignment model is not used at test time and only necessary to train accurate concept and relation identification models.2.3 Concept identification model
The concept identification model chooses a concept (i.e. a labeled node) conditioned on the aligned word or decides that no concept should be introduced (i.e. returns NULL). Though it can be modeled with a softmax classifier, it would not be effective in handling rare or unseen words. First, we split the decision into estimating the probability of concept category (e.g. ‘number’, ’frame’) and estimating the probability of the specific concept within the chosen category. Second, based on a lemmatizer and training data^{5}^{5}5See supplementary materials. we prepare one candidate concept for each word in vocabulary (e.g., it would propose want if the word is wants). Similar to luong2014addressing, our model can then either copy the candidate or rely on the softmax over potential concepts of category . Formally, the concept prediction model is defined as
where the first multiplicative term is a softmax classifier over categories (including NULL); (for ) are model parameters; denotes the indicator function and equals 1 if its argument is true and 0, otherwise; is the partition function ensuring that the scores sum to 1.
2.4 Relation identification model
We use the following arcfactored relation identification model:
(2) 
Each term is modeled in exactly the same way:

for both endpoints, embedding of the concept is concatenated with the RNN state ;

they are linearly projected to a lower dimension separately through and , where denotes concatenation;

a loglinear model with bilinear scores , is used to compute the probabilities.
In the above discussion, we assumed that BiLSTM encodes a sentence once and the BiLSTM states are then used to predict concepts and relations. In semantic role labeling, the task closely related to the relation identification stage of AMR parsing, a slight modification of this approach was shown more effective Zhou and Xu (2015); Marcheggiani et al. (2017). In that previous work, the sentence was encoded by a BiLSTM once per each predicate (i.e. verb) and the encoding was in turn used to identify arguments of that predicate. The only difference across the reencoding passes was a binary flag used as input to the BiLSTM encoder at each word position. The flag was set to 1 for the word corresponding to the predicate and to 0 for all other words. In that way, BiLSTM was encoding the sentence specifically for predicting arguments of a given predicate. Inspired by this approach, when predicting label for , we input binary flags to the BiLSTM encoder which are set to for the word indexed by () and to for other words (, for ). This also means that BiLSTM encoders for predicting relations and concepts end up being distinct. We use this multipass approach in our experiments.^{6}^{6}6Using the vanilla onepass model from equation (2) results in 1.4% drop in Smatch score.
2.5 Alignment model
Recall that the alignment model is only used at training, and hence it can rely both on input (states ) and on the list of concepts .
Formally, we add NULL concepts to the list.^{7}^{7}7After recategorization (Section 3.1), holds for most cases. For exceptions, we append NULL to the sentence. Aligning a word to any NULL, would correspond to saying that the word is not aligned to any ‘real’ concept. Note that each onetoone alignment (i.e. permutation) between such concepts and words implies a valid injective alignment of words to ‘real’ concepts. This reduction to permutations will come handy when we turn to the GumbelSinkhorn relaxation in the next section. Given this reduction, from now on, we will assume that .
As with sentences, we use a BiLSTM model to encode concepts , where , . We use a globallynormalized alignment model:
where is the intractable partition function and the terms score each alignment link according to a bilinear form
(3) 
where is a parameter matrix.
2.6 Estimating model with GumbelSinkhorn
Recall that our learning objective (1) involves expectation under the alignment model. The partition function of the alignment model is intractable, and it is tricky even to draw samples from the distribution. Luckily, the recently proposed relaxation Mena et al. (2018) lets us circumvent this issue. First, note that exact samples from a categorical distribution can be obtained using the perturbandmax technique Papandreou and Yuille (2011). For our alignment model, it would correspond to adding independent noise to the score for every possible alignment and choosing the highest scoring one:
(4) 
where is the set of all permutations of elements, is a noise drawn independently for each from the fixed Gumbel distribution (). Unfortunately, this is also intractable, as there are permutations. Instead, in perturbandmax an approximate schema is used where noise is assumed factorizable. In other words, first noisy scores are computed as , where and an approximate sample is obtained by
Such sampling procedure is still intractable in our case and also nondifferentiable. The main contribution of sinkhorn is approximating this with a simple differentiable computation which yields an approximate (i.e. relaxed) permutation. We use and to denote the matrices of alignment scores and noise variables , respectively. Instead of returning index for every concept , it would return a (peaky) distribution over words . The peakiness is controlled by the temperature parameter of GumbelSinkhorn which balances smoothness (‘differentiability’) vs. bias of the estimator. For further details and the derivation, we refer the reader to the original paper Mena et al. (2018).
Note that is a function of the alignment model , so we will write in what follows. The variational bound (1) can now be approximated as
(5) 
Following sinkhorn, the original KL term from equation (1) is approximated by the KL term between two matrices of i.i.d. Gumbel distributions with different temperature and mean. The parameter is the ‘prior temperature’.
Using the GumbelSinkhorn construction unfortunately does not guarantee that . To encourage this equality to hold, and equivalently to discourage overlapping alignments, we add another regularizer to the objective (5):
(6) 
Our final objective is fully differentiable with respect to all parameters (i.e. , and ) and has low variance as sampling is performed from the fixed nonparameterized distribution, as in standard VAEs.
2.7 Relaxing concept and relation identification
One remaining question is how to use the soft input in the concept and relation identification models in equation (5). In other words, we need to define how we compute and .
The standard technique would be to pass to the models expectations under the relaxed variables , instead of the vectors Maddison et al. (2017); Jang et al. (2017)
. This is what we do for the relation identification model. We use this approach also to relax the onehot encoding of the predicate position (
, see Section 2.4).However, the concept prediction model relies on the pointing mechanism, i.e. directly exploits the words rather than relies only on biLSTM states . So instead we treat as a prior in a hierarchical model:
(7) 
As we will show in our experiments, a softer version of the loss is even more effective:
(8) 
where we set the parameter . We believe that using this loss encourages the model to more actively explore the alignment space. Geometrically, the loss surface shaped as a ball in the 0.5norm space would push the model away from the corners, thus encouraging exploration.
3 Pre and postpocessing
3.1 ReCategorization
AMR parsers often rely on a preprocessing stage, where specific subgraphs of AMR are grouped together and assigned to a single node with a new compound category (e.g., werling2015robust,folandmartin:2017:Long,peng2017addressing); this transformation is reversed at the postprocessing stage. Our approach is very similar to the Factored Concept Label system of wang2017getting, with one important difference that we unpack our concepts before the relation identification stage, so the relations are predicted between original concepts (all nodes in each group share the same alignment distributions to the RNN states). Intuitively, the goal is to ensure that concepts rarely lexically triggered (e.g., thing in Figure 3) get grouped together with lexically triggered nodes. Such ‘primary’ concepts get encoded in the category of the concept (the set of categories is , see also section 2.3). In Figure 3, the recategorized concept thing(opinion) is produced from thing and opine01. We use concept as the dummy category type. There are 8 templates in our system which extract recategorizations for fixed phrases (e.g. thing(opinion)), and a deterministic system for grouping lexically flexible, but structurally stable subgraphs (e.g., named entities, haverelrole91 and haveorgrole91 concepts).
Details of the recategorization procedure and other preprocessing are provided in appendix.
3.2 Postprocessing
For postprocessing, we handle sensedisambiguation, wikification and ensure legitimacy of the produced AMR graph. For sense disambiguation we pick the most frequent sense for that particular concept (‘01’, if unseen). For wikification we again lookup in the training set and default to ””. There is certainly room for improvement in both stages. Our probability model predicts edges conditional independently and thus cannot guarantee the connectivity of AMR graph, also there are additional constraints which are useful to impose. We enforce three constraints: (1) specific concepts can have only one neighbor (e.g., ‘number’ and ‘string’; see appendix for details); (2) each predicate concept can have at most one argument for each relation ; (3) the graph should be connected. Constraint (1) is addressed by keeping only the highest scoring neighbor. In order to satisfy the last two constraints we use a simple greedy procedure. First, for each edge, we pickup the highest scoring relation and edge (possibly NULL). If the constraint (2) is violated, we simply keep the highest scoring edge among the duplicates and drop the rest. If the graph is not connected (i.e. constraint (3) is violated), we greedily choose edges linking the connected components until the graph gets connected (MSCG in Flanigan_adiscriminative).
Finally, we need to select a root node. Similarly to relation identification, for each candidate concept , we concatenate its embedding with the corresponding LSTM state () and use these scores in a softmax classifier over all the concepts.
4 Experiments and Discussion
4.1 Data and setting
We primarily focus on the most recent LDC2016E25 (R2) dataset, which consists of 36521, 1368 and 1371 sentences in training, development and testing sets, respectively. The earlier LDC2015E86 (R1) dataset has been used by much of the previous work. It contains 16833 training sentences, and same sentences for development and testing as R2.^{8}^{8}8Annotation in R2 has also been slightly revised.
We used the development set to perform model selection and hyperparameter tuning. The hyperparameters, as well as information about embeddings and preprocessing, are presented in the supplementary materials.
We used Adam Kingma and Ba (2014) to optimize the loss (5) and to train the root classifier. Our best model is trained fully jointly, and we do early stopping on the development set scores. Training takes approximately 6 hours on a single GeForce GTX 1080 Ti with Intel Xeon CPU E52620 v4.
Model  Data  Smatch 
JAMR Flanigan et al. (2016)  R1  67.0 
AMREager Damonte et al. (2017)  R1  64.0 
CAMR Wang et al. (2016)  R1  66.5 
SEQ2SEQ + 20M Konstas et al. (2017)  R1  62.1 
MulBiLSTM Foland and Martin (2017)  R1  70.7 
Ours  R1  73.7 
NeuralPointer Buys and Blunsom (2017)  R2  61.9 
ChSeq van Noord and Bos (2017)  R2  64.0 
ChSeq + 100K van Noord and Bos (2017)  R2  71.0 
Ours  R2  74.4 0.16 
4.2 Experiments and discussion
We start by comparing our parser to previous work (see Table 1). Our model substantially outperforms all the previous models on both datasets. Specifically, it achieves 74.4% Smatch score on LDC2016E25 (R2), which is an improvement of 3.4% over character seq2seq model relying on silver data van Noord and Bos (2017). For LDC2015E86 (R1), we obtain 73.7% Smatch score, which is an improvement of 3.0% over the previous best model, multiBiLSTM parser of folandmartin:2017:Long.
Models  A’  C’  J’  Ch’  Ours 

17  16  16  17  
Dataset  R1  R1  R1  R2  R2 
Smatch  64  63  67  71  74.4 
Unlabeled  69  69  69  74  77.1 
No WSD  65  64  68  72  75.5 
Reentrancy  41  41  42  52  52.3 
Concepts  83  80  83  82  85.9 
NER  83  75  79  79  86.0 
Wiki  64  0  75  65  75.7 
Negations  48  18  45  62  58.4 
SRL  56  60  60  66  69.8 
In order to disentangle individual phenomena, we use the AMRevaluation tools Damonte et al. (2017) and compare to systems which reported these scores (Table 2). We obtain the highest scores on most subtasks. The exception is negation detection. However, this is not too surprising as many negations are encoded with morphology, and character models, unlike our wordlevel model, are able to capture predictive morphological features (e.g., detect prefixes such as “un” or “im”).
Metric  Pre  R1  Pre  R2 

Align  Align  mean  
Smatch  72.8  73.7  73.5  74.4 
Unlabeled  75.3  76.3  76.1  77.1 
No WSD  73.8  74.7  74.6  75.5 
Reentrancy  50.2  50.6  52.6  52.3 
Concepts  85.4  85.5  85.5  85.9 
NER  85.3  84.8  85.3  86.0 
Wiki  66.8  75.6  67.8  75.7 
Negations  56.0  57.2  56.6  58.4 
SRL  68.8  68.9  70.2  69.8 
Now, we turn to ablation tests (see Table 3). First, we would like to see if our latent alignment framework is beneficial. In order to test this, we create a baseline version of our system (‘prealign’) which relies on the JAMR aligner Flanigan et al. (2014), rather than induces alignments as latent variables. Recall that in our model we used training data and a lemmatizer to produce candidates for the concept prediction model (see Section 2.3
, the copy function). In order to have a fair comparison, if a concept is not aligned after JAMR, we try to use our copy function to align it. If an alignment is not found, we make the alignment uniform across the unaligned words. In preliminary experiments, we considered alternatives versions (e.g., dropping concepts unaligned by JAMR or dropping concepts unaligned after both JAMR and the matching heuristic), but the chosen strategy was the most effective. These scores of prealign are superior to the results from folandmartin:2017:Long which also relies on JAMR alignments and uses BiLSTM encoders. There are many potential reasons for this difference in performance. For example, their relation identification model is different (e.g., single pass, no biaffine modeling), they used much smaller networks than us, they use plain JAMR rather than a combination of JAMR and our copy function, they use a different recategorization system. These results confirm that we started with a strong basic model, and that our variational alignment framework provided further gains in performance.
Ablation  Concepts  SRL  Smatch 

2 stages  85.6  68.9  73.6 
2 stages, tune align  85.6  69.2  73.9 
Full model 
85.9  69.8  74.4 
Now we would like to confirm that joint training of alignments with both concepts and relations is beneficial. In other words, we would like to see if alignments need to be induced in such a way as to benefit the relation identification task. For this ablation we break the full joint training into two stages. We start by jointly training the alignment model and the concept identification model. When these are trained, we optimizing the relation model but keep the concept identification model and alignment models fixed (‘2 stages’ in see Table 4). When compared to our joint model (‘full model’), we observe a substantial drop in Smatch score (0.8%). In another version (‘2 stages, tune align’) we also use two stages but we finetune the alignment model on the second stage. This approach appears slightly more accurate but still 0.5% below the full model. In both cases, the drop is more substantial for relations (‘SRL’). In order to see why relations are potentially useful in learning alignments, consider Figure 4. The example contains duplicate concepts long. The concept prediction model factorizes over concepts and does not care which way these duplicates are aligned: correctly (green edges) or not (red edges). Formally, the true posterior under the conceptonly model in ‘2 stages’ assigns exactly the same probability to both configurations, and the alignment model will be forced to mimic it (even though it relies on an LSTM model of the graph). The spurious ambiguity will have a detrimental effect on the relation identification stage.
Ablation  Concepts  SRL  Smatch 

No Sinkhorn  85.7  69.3  73.8 
No Sinkhorn reg  85.6  69.5  74.2 
No soft loss  85.2  69.1  73.7 
Full model  85.9  69.8  74.4 
It is interesting to see the contribution of other modeling decisions we made when modeling and relaxing alignments. First, instead of using GumbelSinkhorn, which encourages mutuallyrepulsive alignments, we now use a factorized alignment model. Note that this model (‘No Sinkhorn’ in Table 5) still relies on (relaxed) discrete alignments (using Gumbel softmax) but does not constrain the alignments to be injective. A substantial drop in performance indicates that the prior knowledge about the nature of alignments appears beneficial. Second, we remove the additional regularizer for GumbelSinkhorn approximation (equation (6)). The performance drop in Smatch score (‘No Sinkhorn reg’) is only moderate. Finally, we show that using the simple hierarchical relaxation (equation (7)) rather than our softer version of the loss (equation (8)) results in a substantial drop in performance (‘No soft loss’, 0.7% Smatch). We hypothesize that the softer relaxation favors exploration of alignments and helps to discover better configurations.
5 Additional Related Work
Alignment performance has been previously identified as a potential bottleneck affecting AMR parsing Damonte et al. (2017); Foland and Martin (2017). Some recent work has focused on building aligners specifically for training their parsers Werling et al. (2015); Wang and Xue (2017). However, those aligners are trained independently of concept and relation identification and only used at preprocessing.
Treating alignment as discrete variables has been successful in some sequence transduction tasks with neural models Yu et al. (2017, 2016). Our work is similar in that we also train discrete alignments jointly but the tasks, the inference framework and the decoders are very different.
The discrete alignment modeling framework has been developed in the context of traditional (i.e. nonneural) statistical machine translation Brown et al. (1993). Such translation models have also been successfully applied to semantic parsing tasks (e.g., Andreas et al. (2013)), where they rivaled specialized semantic parsers from that period. However, they are considerably less accurate than current stateoftheart parsers applied to the same datasets (e.g., Dong and Lapata (2016)).
For AMR parsing, another way to avoid using pretrained aligners is to use seq2seq models Konstas et al. (2017); van Noord and Bos (2017). In particular, Character used character level seq2seq model and achieved the previous stateoftheart result. However, their model is very data demanding as they needed to train it on additional 100K sentences parsed by other parsers. This may be due to two reasons. First, seq2seq models are often not as strong on smaller datasets. Second, recurrent decoders may struggle with predicting the linearized AMRs, as many statistical dependencies are highly nonlocal.
6 Conclusions
We introduced a neural AMR parser trained by jointly modeling alignments, concepts and relations. We make such joint modeling computationally feasible by using the variational autoencoding framework and continuous relaxations. The parser achieves stateoftheart results and ablation tests show that joint modeling is indeed beneficial.
We believe that the proposed approach may be extended to other parsing tasks where alignments are latent (e.g., parsing to logical form Liang (2016)). Another promising direction is integrating character seq2seq to substitute the copy function. This should also improve the handling of negation and rare words. Though our parsing model does not use any linearization of the graph, we relied on LSTMs and somewhat arbitrary linearization (depthfirst traversal) to encode the AMR graph in our alignment model. A better alternative would be to use graph convolutional networks Marcheggiani and Titov (2017); Kipf and Welling (2017): neighborhoods in the graph are likely to be more informative for predicting alignments than the neighborhoods in the graph traversal.
Acknowledgments
We thank Marco Damonte, Shay Cohen, Diego Marcheggiani and Wilker Aziz for helpful discussions as well as anonymous reviewers for their suggestions. The project was supported by the European Research Council (ERC StG BroadSem 678254) and the Dutch National Science Foundation (NWO VIDI 639.022.518).
References
 Andreas et al. (2013) Jacob Andreas, Andreas Vlachos, and Stephen Clark. 2013. Semantic parsing as machine translation. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), volume 2, pages 47–52.

Artzi et al. (2015)
Yoav Artzi, Kenton Lee, and Luke Zettlemoyer. 2015.
Broadcoverage CCG semantic parsing with AMR.
In
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing
, pages 1699–1710.  Bahdanau et al. (2015) Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate. International Conference on Learning Representations.
 Banarescu et al. (2013) Laura Banarescu, Claire Bonial, Shu Cai, Madalina Georgescu, Kira Griffitt, Ulf Hermjakob, Kevin Knight, Philipp Koehn, Martha Palmer, and Nathan Schneider. 2013. Abstract Meaning Representation for Sembanking.
 Bornschein and Bengio (2015) Jörg Bornschein and Yoshua Bengio. 2015. Reweighted wakesleep. International Conference on Learning Representations.
 Brown et al. (1993) Peter F. Brown, Vincent J. Della Pietra, Stephen A. Della Pietra, and Robert L. Mercer. 1993. The mathematics of statistical machine translation: Parameter estimation. Comput. Linguist., 19(2):263–311.
 Buys and Blunsom (2017) Jan Buys and Phil Blunsom. 2017. Oxford at semeval2017 task 9: Neural amr parsing with pointeraugmented attention. In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval2017), pages 914–919. Association for Computational Linguistics.
 Damonte et al. (2017) Marco Damonte, Shay B Cohen, and Giorgio Satta. 2017. An Incremental Parser for Abstract Meaning Representation. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, volume 1, pages 536–546.
 Dohare and Karnick (2017) Shibhansh Dohare and Harish Karnick. 2017. Text Summarization using Abstract Meaning Representation. arXiv preprint arXiv:1706.01678.
 Dong and Lapata (2016) Li Dong and Mirella Lapata. 2016. Language to logical form with neural attention. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 33–43.
 Dozat and Manning (2017) Timothy Dozat and Christopher D. Manning. 2017. Deep Biaffine Attention for Neural Dependency Parsing. International Conference on Learning Representations.
 Flanigan et al. (2016) Jeffrey Flanigan, Chris Dyer, Noah A. Smith, and Jaime Carbonell. 2016. CMU at SemEval2016 Task 8: Graphbased AMR Parsing with Infinite Ramp Loss. In Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval2016), pages 1202–1206. Association for Computational Linguistics.
 Flanigan et al. (2014) Jeffrey Flanigan, Sam Thomson, Jaime Carbonell, Chris Dyer, and Noah A. Smith. 2014. A Discriminative GraphBased Parser for the Abstract Meaning Representation. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1426–1436, Baltimore, Maryland. Association for Computational Linguistics.
 Foland and Martin (2017) William Foland and James H. Martin. 2017. Abstract Meaning Representation Parsing using LSTM Recurrent Neural Networks. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 463–472, Vancouver, Canada. Association for Computational Linguistics.
 Jang et al. (2017) Eric Jang, Shixiang Gu, and Ben Poole. 2017. Categorical reparameterization with gumbelsoftmax. International Conference on Learning Representations.
 Jones et al. (2012) Bevan K. Jones, Jacob Andreas, Daniel Bauer, Karl Moritz Hermann, and Kevin Knight. 2012. SemanticsBased Machine Translation with Hyperedge Replacement Grammars. In COLING.
 Kingma and Ba (2014) Diederik P. Kingma and Jimmy Ba. 2014. Adam: A Method for Stochastic Optimization. International Conference on Learning Representations.
 Kingma and Welling (2014) Diederik P Kingma and Max Welling. 2014. Autoencoding variational bayes. International Conference on Learning Representations.
 Kipf and Welling (2017) Thomas N Kipf and Max Welling. 2017. Semisupervised classification with graph convolutional networks. International Conference on Learning Representations.
 Konstas et al. (2017) Ioannis Konstas, Srinivasan Iyer, Mark Yatskar, Yejin Choi, and Luke Zettlemoyer. 2017. Neural AMR: SequencetoSequence Models for Parsing and Generation. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 146–157, Vancouver, Canada. Association for Computational Linguistics.
 Liang (2016) Percy Liang. 2016. Learning executable semantic parsers for natural language understanding. Communications of the ACM, 59(9):68–76.
 Liu et al. (2015) Fei Liu, Jeffrey Flanigan, Sam Thomson, Norman M. Sadeh, and Noah A. Smith. 2015. Toward Abstractive Summarization Using Semantic Representations. In HLTNAACL.
 Loper and Bird (2002) Edward Loper and Steven Bird. 2002. NLTK: The Natural Language Toolkit. In Proceedings of the ACL02 Workshop on Effective Tools and Methodologies for Teaching Natural Language Processing and Computational Linguistics  Volume 1, ETMTNLP ’02, pages 63–70, Stroudsburg, PA, USA. Association for Computational Linguistics.
 Luong et al. (2015) Thang Luong, Ilya Sutskever, Quoc Le, Oriol Vinyals, and Wojciech Zaremba. 2015. Addressing the rare word problem in neural machine translation. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 11–19, Beijing, China. Association for Computational Linguistics.

Maddison et al. (2017)
Chris J Maddison, Andriy Mnih, and Yee Whye Teh. 2017.
The concrete distribution: A continuous relaxation of discrete random variables.
International Conference on Learning Representations.  Manning et al. (2014) Christopher D. Manning, Mihai Surdeanu, John Bauer, Jenny Finkel, Steven J. Bethard, and David McClosky. 2014. The Stanford CoreNLP Natural Language Processing Toolkit. In Association for Computational Linguistics (ACL) System Demonstrations, pages 55–60.
 Marcheggiani et al. (2017) Diego Marcheggiani, Anton Frolov, and Ivan Titov. 2017. A Simple and Accurate SyntaxAgnostic Neural Model for Dependencybased Semantic Role Labeling. In Proceedings of the 21st Conference on Computational Natural Language Learning (CoNLL 2017), pages 411–420, Vancouver, Canada. Association for Computational Linguistics.
 Marcheggiani and Titov (2017) Diego Marcheggiani and Ivan Titov. 2017. Encoding Sentences with Graph Convolutional Networks for Semantic Role Labeling. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 1507–1516, Copenhagen, Denmark. Association for Computational Linguistics.
 Mena et al. (2018) Gonzalo Mena, David Belanger, Scott Linderman, and Jasper Snoek. 2018. Learning Latent Permutations with GumbelSinkhorn Networks. International Conference on Learning Representations. Accepted as poster.

Mitra and Baral (2016)
Arindam Mitra and Chitta Baral. 2016.
Addressing a question answering challenge by combining statistical
methods with inductive rule learning and reasoning.
In
30th AAAI Conference on Artificial Intelligence, AAAI 2016
. AAAI press. 
Mnih and Gregor (2014)
Andriy Mnih and Karol Gregor. 2014.
Neural variational inference and learning in belief networks.
In
Proceedings of the International Conference on Machine Learning
.  van Noord and Bos (2017) Rik van Noord and Johan Bos. 2017. Neural Semantic Parsing by Characterbased Translation: Experiments with Abstract Meaning Representations. Computational Linguistics in the Netherlands Journal, 7:93–108.
 Papandreou and Yuille (2011) George Papandreou and Alan L Yuille. 2011. Perturbandmap random fields: Using discrete optimization to learn and sample from energy models. In Computer Vision (ICCV), 2011 IEEE International Conference on, pages 193–200. IEEE.

Paszke et al. (2017)
Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary
DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. 2017.
Automatic differentiation in PyTorch.
 Peng et al. (2017) Xiaochang Peng, Chuan Wang, Daniel Gildea, and Nianwen Xue. 2017. Addressing the Data Sparsity Issue in Neural AMR Parsing. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, pages 366–375. Association for Computational Linguistics.
 Pennington et al. (2014) Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. GloVe: Global Vectors for Word Representation. In Empirical Methods in Natural Language Processing (EMNLP), pages 1532–1543.
 Pourdamghani et al. (2014) Nima Pourdamghani, Yang Gao, Ulf Hermjakob, and Kevin Knight. 2014. Aligning english strings with abstract meaning representation graphs. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 425–429.
 Schuster and Paliwal (1997) M. Schuster and K.K. Paliwal. 1997. Bidirectional Recurrent Neural Networks. Trans. Sig. Proc., 45(11):2673–2681.
 Wang et al. (2016) Chuan Wang, Sameer Pradhan, Xiaoman Pan, Heng Ji, and Nianwen Xue. 2016. CAMR at SemEval2016 Task 8: An Extended Transitionbased AMR Parser. In Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval2016), pages 1173–1178, San Diego, California. Association for Computational Linguistics.
 Wang and Xue (2017) Chuan Wang and Nianwen Xue. 2017. Getting the Most out of AMR Parsing. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 1257–1268.
 Werling et al. (2015) Keenon Werling, Gabor Angeli, and Christopher D. Manning. 2015. Robust Subgraph Generation Improves Abstract Meaning Representation Parsing. In ACL.
 Yu et al. (2017) Lei Yu, Phil Blunsom, Chris Dyer, Edward Grefenstette, and Tomas Kocisky. 2017. The Neural Noisy Channel. In International Conference on Learning Representations.
 Yu et al. (2016) Lei Yu, Jan Buys, and Phil Blunsom. 2016. Online Segment to Segment Neural Transduction. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 1307–1316. Association for Computational Linguistics.

Zhou and Xu (2015)
Jie Zhou and Wei Xu. 2015.
Endtoend learning of semantic role labeling using recurrent neural networks.
In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), volume 1, pages 1127–1137.
Supplementary Material
7 Matching algorithm for copying concepts
Only frequent concepts (frequency at least 10 for R2 and 5 for R1) can be generated without the copying mechanism (i.e. have their own vector associated with them). Both frequent and infrequent ones are processed with coping, using candidates produced by the algorithm below and the matching rule in Table 6.
Rules  Matching Criteria 

Verbalization Match  exact match frame in ”verbalizationlistv1.06.txt” 
PropBank Match  exact match frame in PropBank frame files 
Suffix Removal Match  word with suffix (“ed”, “ly”,“ing”) removed is identical to concept lemma 
Editdistance Match  edit distance smaller than 50% of the length 
8 Recategorization details
Recategorization is handled with rules listed in Table 2. They are triggered if a given primary concept (‘primary’) appears adjacent to edges labeled with relations given in column ‘rel’. The assigned category is shown in column ‘recategorized’. The rules yield 32 categories when applied to the training set.
primary  rel  recategorized 

person  ARG0of/ARG1of  person([second]) 
thing  ARG0of/ARG1of/ARG2of  thing([second]) 
most  degreeof  most([second]) 
quantity  unit  primary([second]) 
dateentity  weekday/dayperiod/season  dateentity([second]) 
monetaryquantity  unit/ARG2of/ARG1of/quant  monetaryquantity([second]) 
temporalquantity  unit/ARG3of  temporalquantity([second]) 
There are also rules of another type shown in Table 3 below. The templates and examples are in column ‘original’, the resulting concepts are in column ‘recategorized’. These rules yield 109 additional types when applied to the training set.
original  recategorized 


(BNer_type(n1),…,Ner_type(nx)) 
BNer_city(New),Ner_city(York)  
havexrole_type(role)  
haveorgrole_person(premier)  
xentity(constant)  
ordinalentity(1) 
9 Additional preprocessing
Besides constructing recategorized AMR concepts, we perform additional preprocessing. We start with tokenized dataset of Pourdamghani et al. (2014). We take all dashed AMR concepts (e.g, makeup and morethan
) and concatenate the corresponding spans (based on statistics from training set and PropBank frame files). We also combine spans of words corresponding to a single number. For relation identification, we normalize relations to one canonical direction (e.g. arg0, timeof). For named entity recognition, and lemmatization, we use Stanford CoreNLP toolkit
Manning et al. (2014). For pretrained embedding, we used Glove (300 dimensional embeddings) Pennington et al. (2014).10 Model parameters and optimization details
We selected hyperparameters based on the best performance on the development set. For all the ablation tests, the hyper parameters are fixed. We used 2 different BiLSTM encoders of the same hyperparameters to encode sentence for concept identification and alignment prediction, another BiLSTM to encode AMR concept sequence for alignment, and finally 2 different BiLSTM of the same hyperparameters to encode sentence for relation identification and root identification. There are 5 BiLSTM encoders in total. Hyper parameters for the model are summarized in Table 9, and optimization parameters are summarized in Table 10.
Model components  Hyperparameters 

Glove Embeddings  300 
Lemma Embeddings  200 
POS Embeddings  32 
NER Embeddings  16 
Category Embeddings  32 
Concept/Alignment  1 layer 548 input 
Sentence BiLSTM  256 hidden (each direction) 
AMR Categories  32 
AMR Lemmas  506 
AMR NER types  109 
Alignment  1 layer 232 input 
AMR BiLSTM  100 hidden (each direction) 
bilinear align  200 512 
Relation map dimensionality  200 
Relation/Root  2 layers 549 input (predicate position) 
Sentence BiLSTM  256 hidden (each direction) 
relation vector  200 
lemma vector  512 
root vector  200 
Sinkhorn temperature  1 
Sinkhorn prior temperature  5 
Sinkhorn steps l for full joint training  10 
Sinkhorn steps l for two stages training  5 
10  
Dropout  .2 
Optimizer Parameters  Values 

Batch size for single stage  64 
Maximum Epochs 
30 
Batch size for first stage  512 
Batch size for second stage  64 
Maximum Epochs for both stages  30 
Learning Rate  1e4 
Adam betas  (0.9, 0.999) 
Adam eps  1e8 
Weight decay  1e5 
Comments
There are no comments yet.