AMR Parsing as Sequence-to-Graph Transduction

05/21/2019 ∙ by Sheng Zhang, et al. ∙ 0

We propose an attention-based model that treats AMR parsing as sequence-to-graph transduction. Unlike most AMR parsers that rely on pre-trained aligners, external semantic resources, or data augmentation, our proposed parser is aligner-free, and it can be effectively trained with limited amounts of labeled AMR data. Our experimental results outperform all previously reported SMATCH scores, on both AMR 2.0 (76.3 (70.2

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Abstract Meaning Representation (AMR, Banarescu et al., 2013) parsing is the task of transducing natural language text into AMR, a graph-based formalism used for capturing sentence-level semantics. Challenges in AMR parsing include: (1) its property of reentrancy – the same concept can participate in multiple relations – which leads to graphs in contrast to trees (Wang et al., 2015); (2) the lack of gold alignments between nodes (concepts) in the graph and words in the text which limits attempts to rely on explicit alignments to generate training data (Flanigan et al., 2014; Wang et al., 2015; Damonte et al., 2017; Foland and Martin, 2017; Peng et al., 2017b; Groschwitz et al., 2018; Guo and Lu, 2018); and (3) relatively limited amounts of labeled data (Konstas et al., 2017).

Figure 1: Two views of reentrancy in AMR for an example sentence “The victim could help himself.” (a) A standard AMR graph. (b) An AMR tree with node indices as an extra layer of annotation, where the corresponding graph can be recovered by merging nodes of the same index.

Recent attempts to overcome these challenges include: modeling alignments as latent variables (Lyu and Titov, 2018); leveraging external semantic resources (Artzi et al., 2015; Bjerva et al., 2016); data augmentation (Konstas et al., 2017; van Noord and Bos, 2017); and employing attention-based sequence-to-sequence models Barzdins and Gosko (2016); Konstas et al. (2017); van Noord and Bos (2017).

In this paper, we introduce a different way to handle reentrancy, and propose an attention-based model that treats AMR parsing as sequence-to-graph transduction. The proposed model, supported by an extended pointer-generator network, is aligner-free and can be effectively trained with limited amount of labeled AMR data. Experiments on two publicly available AMR benchmarks demonstrate that our parser clearly outperforms the previous best parsers on both benchmarks. It achieves the best reported Smatch scores: 76.3% F1 on LDC2017T10 and 70.2% F1 on LDC2014T12. We also provide extensive ablative and qualitative studies, quantifying the contributions from each component.

2 Another View of Reentrancy

AMR is a rooted, directed, and usually acyclic graph where nodes represent concepts, and labeled directed edges represent the relationships between them (see for an AMR example in Figure 1). The reason for AMR being a graph instead of a tree is that it allows reentrant semantic relations. For instance, in Figure 1(a) “victim” is both ARG0 and ARG1 of “help-01”. While efforts have gone into developing graph-based algorithms for AMR parsing (Chiang et al., 2013; Flanigan et al., 2014), it is more challenging to parse a sentence into a graph rather than a tree as there are efficient off-the-shelf tree-based algorithms, e.g., Chu and Liu (1965); Edmonds (1968). To leverage these tree-based algorithms as well as other structured prediction paradigms McDonald et al. (2005), we introduce another view of reentrancy.

AMR reentrancy is employed when a node participates in multiple semantic relations. We convert an AMR graph into a tree by duplicating nodes that have reentrant relations; that is, whenever a node has a reentrant relation, we make a copy of that node and use the copy to participate in the relation, thereby resulting in a tree. Next, in order to preserve the reentrancy information, we add an extra layer of annotation by assigning an index to each node. Duplicated nodes are assigned the same index as the original node. Figure 1(b) shows a resultant AMR tree: subscripts of nodes are indices; two “victim” nodes have the same index as they refer to the same concept. The original AMR graph can be recovered by merging identically indexed nodes and unioning edges from/to these nodes. A similar idea was used by Artzi et al. (2015) who introduced Skolem IDs to represent anaphoric references in the transformation from CCG to AMR .

3 Task Formalization

If we consider the AMR tree with indexed nodes as the prediction target, then our approach to parsing is formalized as a two-stage process: node prediction and edge prediction.111 The two-stage process is similar to first “concept identification” and then “relation identification” in Flanigan et al. (2014); Zhou et al. (2016); Lyu and Titov (2018); inter alia. An example of the parsing process is shown in Figure 2.

Figure 2: A two-stage process of AMR parsing. We remove senses (i.e., -01, -02, etc.) as they will be assigned in the post-processing step.

Node Prediction Given a input sentence , each a word in the sentence, our approach sequentially decodes a list of nodes and deterministically assigns their indices .

Note that we allow the same node to occur multiple times in the list; multiple occurrences of a node will be assigned the same index. We choose to predict nodes sequentially rather than simultaneously, because (1) we believe the current node generation is informative to the future node generation; (2) variants of efficient sequence-to-sequence models Bahdanau et al. (2014); Vinyals et al. (2015) can be employed to model this process. At the training time, we obtain the reference list of nodes and their indices using a pre-order traversal over the reference AMR tree. We also evaluate other traversal strategies, and will discuss their difference in Section 7.2.

Figure 3:

Extended pointer-generator network for node prediction. For each decoding time step, three probabilities

, and are calculated. The source and target attention distributions as well as the vocabulary distribution are weighted by these probabilities respectively, and then summed to obtain the final distribution, from which we make our prediction. Best viewed in color.

Edge Prediction Given a input sentence , a node list , and indices , we look for the highest scoring parse tree in the space of valid trees over with the constraint of . A parse tree is a set of directed head-modifier edges . In order to make the search tractable, we follow the arc-factored graph-based approach McDonald et al. (2005); Kiperwasser and Goldberg (2016), decomposing the score of a tree to the sum of the score of its head-modifier edges:

Based on the scores of the edges, the highest scoring parse tree (i.e., maximum spanning arborescence) can be efficiently found using the Chu-Liu-Edmonnds algorithm. We further incorporate indices as constraints in the algorithm, which is described in Section 4.4. After obtaining the parse tree, we merge identically indexed nodes to recover the standard AMR graph.

4 Model

Our model has two main modules: (1) an extended pointer-generator network for node prediction; and (2) a deep biaffine classifier for edge prediction. The two modules correspond to the two-stage process for AMR parsing, and they are jointly learned during training.

4.1 Extended Pointer-Generator Network

Inspired by the self-copy mechanism in Zhang et al. (2018), we extend the pointer-generator network See et al. (2017)

for node prediction. The pointer-generator network was proposed for text summarization, which can copy words from the source text via

pointing, while retaining the ability to produce novel words through the generator. The major difference of our extension is that it can copy nodes, not only from the source text, but also from the previously generated nodes on the target side. This target-side pointing is well-suited to our task as nodes we will predict can be copies of other nodes. While there are other pointer/copy networks Gulcehre et al. (2016); Merity et al. (2016); Gu et al. (2016); Miao and Blunsom (2016); Nallapati et al. (2016), we found the pointer-generator network very effective at reducing data sparsity in AMR parsing, which will be shown in Section 7.2.

As depicted in Figure 3, the extended pointer-generator network consists of four major components: an encoder embedding layer, an encoder, a decoder embedding layer, and a decoder.

Encoder Embedding Layer

This layer converts words in input sentences into vector representations. Each vector is the concatenation of embeddings of GloVe 

Pennington et al. (2014), BERT Devlin et al. (2018)

, POS (part-of-speech) tags and anonymization indicators, and features learned by a character-level convolutional neural network (CharCNN, 

Kim et al., 2016).

Anonymization indicators are binary indicators that tell the encoder whether the word is an anonymized word. In preprocessing, text spans of named entities in input sentences will be replaced by anonymized tokens (e.g. person, country) to reduce sparsity (see the Appendix for details).

Except BERT, all other embeddings are fetched from their corresponding learned embedding look-up tables. As BERT adopts WordPiece tokenization Schuster and Nakajima (2012); Wu et al. (2016), one word may correspond to multiple hidden states of BERT. In order to accurately use these hidden states to represent each word, we apply an average pooling function222

We compare average pooling with max pooling on the test data in the Appendix.

to the outputs of BERT. Figure 4 illustrates the process of generating word-level embeddings from BERT.

Figure 4: Word-level embeddings from BERT.

Encoder The encoder is a multi-layer bidirectional RNN Schuster and Paliwal (1997):

where and are two LSTM cells Hochreiter and Schmidhuber (1997); is the -th layer encoder hidden state at the time step ; is the encoder embedding layer output for word .

Decoder Embedding Layer Similar to the encoder embedding layer, this layer outputs vector representations for AMR nodes. The difference is that each vector is the concatenation of embeddings of GloVe, POS tags and indices, and feature vectors from CharCNN.

POS tags of nodes are inferred at runtime: if a node is a copy from the input sentence, the POS tag of the corresponding word is used; if a node is a copy from the preceding nodes, the POS tag of its antecedent is used; if a node is a new node emitted from the vocabulary, an UNK tag is used.

We do not include BERT embeddings in this layer because AMR nodes, especially their order, are significantly different from natural language text (on which BERT was pre-trained). We tried to use “fixed” BERT in this layer, which did not lead to improvement.333 Limited by the GPU memory, we do not fine-tune BERT on this task and leave it for future work.

Decoder At each step , the decoder (an -layer unidirectional LSTM) receives hidden state from the last layer and hidden state from the previous time step, and generates hidden state :

where is the concatenation (i.e., the input-feeding approach, Luong et al., 2015) of two vectors: the decoder embedding layer output for the previous node (while training, is the previous node of the reference node list; at test time it is the previous node emitted by the decoder), and the attentional vector from the previous step (explained later in this section). is the concatenation of last encoder hidden states from and respectively.

Source attention distribution is calculated by additive attention Bahdanau et al. (2014):

and it is then used to produce a weighted sum of encoder hidden states, i.e., the context vector .

Attentional vector combines both source and target side information, and it is calculated by an MLP (shown in Figure 3):

The attentional vector has 3 usages:

(1) it is fed through a linear layer and softmax to produce the vocabulary distribution:

(2) it is used to calculate the target attention distribution :

(3) it is used to calculate source-side copy probability , target-side copy probability , and generation probability via a switch layer:

Note that . They act as a soft switch to choose between copying an existing node from the preceding nodes by sampling from the target attention distribution , or emitting a new node in two ways: (1) generating a new node from the fixed vocabulary by sampling from , or (2) copying a word (as a new node) from the input sentence by sampling from the source attention distribution .

The

final probability distribution

for node is defined as follows. If is a copy of existing nodes, then:

otherwise:

where indexes the -th element of . Note that a new node may have the same surface form as the existing node. We track their difference using indices. The index for node is assigned deterministically as below:

4.2 Deep Biaffine Classifier

For the second stage (i.e., edge prediction), we employ a deep biaffine classifier, which was originally proposed for graph-based dependency parsing Dozat and Manning (2016), and recently has been applied to semantic parsing Peng et al. (2017a); Dozat and Manning (2018).

As depicted in Figure 5, the major difference of our usage is that instead of re-encoding AMR nodes, we directly use decoder hidden states from the extended pointer-generator network as the input to deep biaffine classifier. We find two advantages of using decoder hidden states as input: (1) through the input-feeding approach, decoder hidden states contain contextualized information from both the input sentence and the predicted nodes; (2) because decoder hidden states are used for both node prediction and edge prediction, we can jointly train the two modules in our model.

Given decoder hidden states and a learnt vector representation of a dummy root, we follow Dozat and Manning (2016), factorizing edge prediction into two components: one that predicts whether or not a directed edge exists between two nodes and , and another that predicts the best label for each potential edge.

Edge and label scores are calculated as below:444 Definitions of each function are provided in the Appendix.

Given a node , the probability of being the edge head of is defined as:

The edge label probability for edge is defined as:

Figure 5: Deep biaffine classifier for edge prediction. Edge label prediction is not depicted in the figure.

4.3 Training

The training objective is to jointly minimize the loss of reference nodes and edges, which can be decomposed to the sum of the negative log likelihood at each time step for the reference node , the reference edge head of node , and the reference edge label between and :

is a coverage loss to penalize repetitive nodes: , where is the sum of source attention distributions over all previous decoding time steps: . See See et al. (2017) for full details.

4.4 Prediction

For node prediction, based on the final probability distribution at each decoding time step, we implement both greedy search and beam search to sequentially decode a node list and indices .

For edge prediction, given the predicted node list , their indices , and the edge scores , we apply the Chu-Liu-Edmonds algorithm with a simple adaption to find the maximum spanning arborescence. As described in Algorithm 1, before calling the Chu-Liu-Edmonds algorithm, we first include a dummy root to ensure every node have a head, and then exclude edges whose source and destination nodes have the same indices, because these nodes will be merged into a single node to recover the standard AMR graph where self-loops are invalid.

Input : Nodes ,
Indices ,
Edge scores
Output : A maximum spanning arborescence.
// Include the dummy root .
;
;
// Exclude invalid edges.
;
// Chu-Liu-Edmonds algorithm
return maxArborescence();
Algorithm 1 Chu-Liu-Edmonds algo. w/ Adaption

5 Related Work

AMR parsing approaches can be categorized into alignment-based, grammar-based, and attention-based approaches.

Alignment-based approaches were first explored by JAMR Flanigan et al. (2014), a pipeline of concept and relation identification with a graph-based algorithm. Zhou et al. (2016) improved this by jointly learning concept and relation identification with an incremental model. Both approaches rely on features based on alignments. Lyu and Titov (2018) treated alignments as latent variables in a joint probabilistic model, leading to a substantial reported improvement. Our approach requires no explicit alignments, but implicitly learns a source-side copy mechanism using attention.

Grammar-based approaches began with Wang et al. (2015, 2016), who incrementally transform dependency parses into AMRs using transiton-based models, which was followed by a line of research, such as Puzikov et al. (2016); Brandt et al. (2016); Goodman et al. (2016); Damonte et al. (2017); Groschwitz et al. (2018). A pre-trained aligner, e.g. Pourdamghani et al. (2014); Liu et al. (2018), is needed for most parsers to generate training data (e.g., oracles for a transition-based parser). Artzi et al. (2015) leveraged CCGBank categories, and employed a grammar induction approach converting lambda-calculus logical forms into AMRs. Pust et al. (2015) recast AMR parsing as a machine translation problem, while also drawing features from external semantic resources. Our approach makes no significant use of external semantic resources,555 We only use POS tags in the core parsing task. In post-processing, we use an entity linker as a common move for wikification like van Noord and Bos (2017). and is aligner-free.

Attention-based parsing with Seq2Seq-style models have been considered Barzdins and Gosko (2016); Peng et al. (2017b), but are limited by the relatively small amount of labeled AMR data. Konstas et al. (2017) overcame this by making use of millions of unlabeled data through self-training, while van Noord and Bos (2017) showed significant gains via a character-level Seq2Seq model and a large amount of silver-standard AMR training data. In contrast, our approach with the support of extended pointer-generator network can be effectively trained on the limited amount of labeled AMR data, with no data augmentation.

6 AMR Pre- and Post-processing

Anonymization is often used in AMR preprocessing to reduce sparsity (Werling et al., 2015; Peng et al., 2017b; Guo and Lu, 2018, inter alia). Similar to Konstas et al. (2017), we anonymize sub-graphs of named entities and other entities. Like Lyu and Titov (2018), we remove senses, and use Stanford CoreNLP Manning et al. (2014) to lemmatize input sentences and add POS tags.

In post-processing, we assign the most frequent sense for nodes (-01, if unseen) like Lyu and Titov (2018), and restore wiki links using the DBpedia Spotlight API Daiber et al. (2013) following Bjerva et al. (2016); van Noord and Bos (2017). We add polarity attributes based on the rules observed from the training data. More details of pre- and post-processing are provided in the Appendix.

7 Experiments

7.1 Setup

We conduct experiments on two AMR general releases (available to all LDC subscribers): AMR 2.0 (LDC2017T10) and AMR 1.0 (LDC2014T12). Our model is trained using ADAM Kingma and Ba (2014)

for up to 120 epochs, with early stopping based on the development set. Full model training takes about 19 hours on AMR 2.0 and 7 hours on AMR 1.0, using two GeForce GTX TITAN X GPUs. At training, we have to fix BERT parameters due to the limited GPU memory. We leave fine-tuning BERT for future work. Hyper-parameter settings are provided in the Appendix.

7.2 Results

Corpus Parser F1(%)
AMR
2.0
Buys and Blunsom (2017) 61.9
van Noord and Bos (2017) 71.0
Groschwitz et al. (2018) 71.00.5
Lyu and Titov (2018) 74.40.2
Ours 76.30.1
AMR
1.0
Flanigan et al. (2016) 66.0
Pust et al. (2015) 67.1
Wang and Xue (2017) 68.1
Guo and Lu (2018) 68.30.4
Ours 70.20.1
Table 1: Smatch

scores on the test sets of AMR 2.0 and 1.0. Standard deviation is computed over 3 runs with different random seeds.

indicates the previous best score from attention-based models.

Main Results We compare our approach against the previous best approaches and several recent competitors. Table 1 summarizes their Smatch scores Cai and Knight (2013) on the test sets of two AMR general releases. On AMR 2.0, we significantly outperform the previous best approach Lyu and Titov (2018) by 1.9% F1; especially when compared with the previous best attention-based approach van Noord and Bos (2017), our approach shows a substantial gain of 5.3% F1, with no usage of any silver-standard training data. On AMR 1.0 where the traininng instances are only around 10k, we still improve the best reported results by 1.9% F1.

Fine-grained Results In Table 2, we assess the quality of each subtask using the AMR-evaluation tools Damonte et al. (2017). We see a notable increase on reentrancies, which we attribute to target-side copy (based on our ablation studies in the next section). Significant increases are also shown on wikification and negation, indicating the benefits of using DBpedia Spotlight API and negation detection rules in post-processing. On all other subtasks except named entities, our approach achieves competitive results to the previous best approach Lyu and Titov (2018), and outperforms the previous best attention-based approach van Noord and Bos (2017). The difference of scores on named entities is mainly caused by anonymization methods used in preprocessing, which suggests a potential improvement by adapting the anonymization method presented in Lyu and Titov (2018) to our approach.

Metric vN’17 G’18 L’18 Ours
Smatch 71 71 74 76.30.1
Unlabeled 74 74 77 79.00.1
No WSD 72 72 76 76.80.1
Reentrancies 52 49 52 60.00.1
Concepts 82 84 86 84.80.1
Named Ent. 79 78 86 77.90.2
Wikification 65 71 76 85.80.3
Negation 62 57 58 75.20.2
SRL 66 64 70 69.70.2
Table 2: Fine-grained F1 scores on the AMR 2.0 test set. vN’17 is van Noord and Bos (2017); G’18 is Groschwitz et al. (2018); L’18 is Lyu and Titov (2018).
Ablation
AMR
1.0
AMR
2.0
Full model 70.2 76.3
no source-side copy 62.7 70.9
no target-side copy 66.2 71.6
no coverage loss 68.5 74.5
no BERT embeddings 68.8 74.6
no index embeddings 68.5 75.5
no anonym. indicator embed. 68.9 75.6
no beam search 69.2 75.3
no POS tag embeddings 69.2 75.7
no CharCNN features 70.0 75.8
only edge prediction 88.4 90.9
Table 3: Ablation studies on components of our model. (Scores are sorted by the delta from the full model.)

Ablation Study We consider the contributions of several model components in Table 3. The largest performance drop is from removing source-side copy,666All other hyper-parameter settings remain the same. showing its efficiency at reducing sparsity from open-class vocabulary entries. Removing target-side copy also leads to a large drop. Specifically, the subtask score of reentrancies drops down to 38.4% when target-side copy is disabled. Coverage loss is useful with regard to discouraging unnecessary repetitive nodes. In addition, our model benefits from input features such as language representations from BERT, index embeddings, POS tags, anonymization indicators, and character-level features from CharCNN. Note that without BERT embeddings, our model still outperforms the previous best approaches on both corpora. Beam search, commonly used in machine translation, is also helpful in our model. We provide side-by-side examples in the Appendix to further illustrate the contribution from each component, which are largely intuitive, with the exception of BERT embeddings. There the exact contribution of the component (qualitative, before/after ablation) stands out less: future work might consider a probing analysis with manually constructed examples, in the spirit of Linzen et al. (2016); Conneau et al. (2018); Tenney et al. (2019).

In the last row, we only evaluate model performance at the edge prediction stage by forcing our model to decode the reference nodes at the node prediction stage. The results mean if our model could make perfect prediction at the node prediction stage, the final Smatch score will be substantially high, which identifies node prediction as the key to future improvement of our model.

Figure 6:

Frequency, precision and recall of nodes from different sources, based on the AMR 2.0 test set.

Figure 6 shows the frequency of nodes from difference sources, and their corresponding precision and recall based on our model prediction. Definitions of frequency, precision and recall are provided in the Appendix. Among all reference nodes, 43.8% are from vocabulary generation, 47.6% from source-side copy, and only 8.6% from target-side copy. On one hand, the highest frequency of source-side copy helps address sparsity and results in the highest precision and recall. On the other hand, we see space for improvement, especially on the relatively low recall of target-side copy, which is probably due to its low frequency.

Node Linearization As decribed in Section 3, we create the reference node list by a pre-order traversal over the gold AMR tree. As for the children of each node, we sort them in alphanumerical order. This linearization strategy has two advantages: (1) pre-order traversal guarantees that a head node (predicate) always comes in front of its children (arguments); (2) alphanumerical sort orders according to role ID (i.e., ARG0>ARG1>…>ARGn), following intuition from research in Thematic Hierarchies Fillmore (1968); Levin and Hovav (2005).

Node Linearization
AMR
1.0
AMR
2.0
Pre-order + Alphanum 70.2 76.3
Pre-order + Alignment 61.9 68.3
Pure Alignment 64.3 71.3
Table 4: Smatch scores of full models trained and tested based on different node linearization strategies.

In Table 4, we report Smatch scores of full models trained and tested on data generated via our linearization strategy (Pre-order + Alphanum), as compared to two obvious alternates: the first alternate still runs a pre-order traversal, but it sorts the children of each node based on the their alignments to input words; the second one linearizes nodes purely based alignments. Alignments are created using the tool by Pourdamghani et al. (2014). Clearly, our linearization strategy leads to much better results than the two alternates. We also tried other traversal strategies such as combining in-order traversal with alphanumerical sorting or alignment-based sorting, but did not get scores even comparable to the two alternates.777 van Noord and Bos (2017) also investigated linearization order, and found that alignment-based ordering yielded the best results under their setup where AMR parsing is treated as a sequence-to-sequence learning problem.

8 Conclusion

We proposed an attention-based model for AMR parsing where we introduced a series of novel components into a transductive setting that extend beyond what a typical NMT system would do on this task. Our model achieves the state-of-the-art performance on two AMR corpora. For future work, we would like to extend our model to other semantic parsing tasks, such as MRS Copestake et al. (2005) and UCCA Abend and Rappoport (2013). We are also interested in semantic parsing in cross-lingual settings Zhang et al. (2018).

Appendix A Appendices

a.1 Average Pooling vs. Max Pooling

In BERT embeddings, we apply average pooling to the outputs (last-layer hidden states) of BERT in order to generate word-level embeddings for the input sentence. Table 5 shows scores of models using different pooling functions. Average pooling performs slightly better than max pooling.

AMR 1.0 AMR 2.0
Average Pooling 70.20.1 76.30.1
Max Pooling 70.00.1 76.20.1
Table 5: Smatch scores based different pooling functions. Standard deviation is over 3 runs on the test data.

a.2 Hyper-parameter Settings

Table 6 lists the hyper-parameters used in our full model. Both encoder and decoder embedding layers have GloVe and POS tag embeddings as well as CharCNN, but their parameters are not tied. We apply dropout (dropout_rate = 0.33) to the outputs of each module.

a.3 Node Prediction

Let all reference nodes from the source be , and let all predicted nodes from be . In Figure 6, frequency, precision and recall of nodes from source are computed as below:

GloVe embeddings
   source GloVe.840B.300d
   dim 300
BERT embeddings
   source BERT-Large-cased
   dim 1024
POS tag embeddings
   dim 100
Anonymization indicator embeddings
   dim 50
Index embeddings
   dim 50
CharCNN
   num_filters 100
   ngram_filter_sizes [3]
Encoder
   hidden_size 512
   num_layers 2
Decoder
   hidden_size 1024
   num_layers 2
Deep biaffine classifier
   edge_hidden_size 256
   label_hidden_size 128
Optimizer
   type ADAM
   learning_rate 0.001
   max_grad_norm 5.0
   Coverage loss weight 1.0
   Beam size 5
Vocabulary
   encoder_vocab_size (AMR 2.0) 18000
   decoder_vocab_size (AMR 2.0) 12200
   encoder_vocab_size (AMR 1.0) 9200
   decoder_vocab_size (AMR 1.0) 7300
   Batch size 64
Table 6: Hyper-parameter settings

a.4 Side-by-Side Examples

We provide examples from the test set, with side-by-side comparisons between the full model prediction and the model prediction after ablation.

Figure 7: Full model prediction vs. no source-side copy prediction. Tokens in blue are copied from the source side. Without source-side copy, the prediction becomes totally different and inaccurate in this example.
Figure 8: Full model prediction vs. no source-side copy prediction. Nodes in blue denote the same concept (i.e., the country “China”). The full model correctly copies the first node (“vv7 / country”) as ARG0 of “start-01”. Without source-side copy, the model has to generate a new node with a different index, i.e., “vv10 / country”.
Figure 9: Full model prediction vs. no coverage loss prediction. The full model correctly predicts the second modifier “solemn”. Without coverage loss, the model generates a repetitive modifier “magnificent”.
Figure 10: Full model prediction vs. no BERT embeddings prediction.
Figure 11: An example AMR and the corresponding sentence before and after preprocessing. Senses are removed. The first named entity is replaced by “HIGHWAY_0”; the second named entity is replaced by “COUNTRY_REGION_0”; the first date entity replaced by “DATE_0”.

a.5 AMR Pre- and Post-processing

Firstly, we to run Standford CoreNLP as presented by Lyu and Titov (2018), lemmatizing input sentences and adding NER and POS tags to each token. Secondly, we remove senses, wiki links and polarity attributes in AMR. Thirdly, we anonymize sub-graphs of named entities and *-entity in a way similar to Konstas et al. (2017). Figure 11 shows an example before and after preprocessing. Sub-graphs of named entities are headed by one of AMR’s fine-grained entity types (e.g., highway, country_region in figure 11) that contain a :name role. Sub-graphs of other entities are headed by their corresponding entity type name (e.g., date-entity in Figure 11). We replace these sub-graphs with a token of a special pattern “TYPE_i” (e.g. HIGHWAY_0, DATE_0 in Figure 11), where “TYPE” indicates the AMR entity type of the corresponding sub-graph, and “i” indicates that it is the -th occurrence of that type. On the training set, we use simple rules to find mappings between anonymized sub-graphs and spans of text, and then replace mapped text with the anonymized token we inserted into the AMR graph. Additionally, we build a mapping of Standford CoreNLP NER tags to AMR’s fine-grained types based on the training set, which will be used in prediction. During prediction, we normalize test sentences to match our anonymized training data. For any entity span identified by Stanford CoreNLP, we replace it with a AMR entity type based on the mapping built during training. If no entry is found in the mapping, we replace entity spans with the coarse-grained NER tags from Stanford CoreNLP, which are also entity types in AMR.

In post-processing, we deterministically generate AMR sub-graphs for anonymizations using the corresponding text span. We assign the most frequent sense for nodes (-01, if unseen) like Lyu and Titov (2018). We add wiki links to named entities using the DBpedia Spotlight API Daiber et al. (2013) following Bjerva et al. (2016); van Noord and Bos (2017) with the confidence threshod at 0.5. We add polarity attributes based on Algorithm 2 where the four functions isNegation, modifiedWord, mappedNode, and addPolarity consists of simple rules observed from the training set. We use the PENMANCodec888https://github.com/goodmami/penman/ to encode and decode both intermediate and final AMRs. The pre- and post-processing scripts will be released along with our model at http://url.

Input : Sent. , Predicted AMR
Output : AMR with polarity attributes.
for  do
        if isNegation() then
               modifiedWord(, );
               mappedNode(, );
               addPolarity(, );
              
        end if
       
end for
return ;
Algorithm 2 Adding polarity attributes to AMR.

a.6 Deep Biaffine Classifier

MLP, Biaffine and Bilinear in the deep biaffine classifier are defined as below:

References