Introduction
Graphical structure plays an important role in natural language processing (NLP), they often serve as the central formalism for representing syntax, semantics, and knowledge. For example, most syntactic representations (e.g., dependency relation) are treebased while most wholesentence semantic representation frameworks (e.g., Abstract Meaning Representation (AMR)
[3]) encode sentence meaning as directed acyclic graphs. A range of NLP applications can be framed as the process of graphtosequence learning. For instance, text generation may involve realizing a semantic graph into a surface form [19] and syntactic machine translation incorporates sourceside syntax information for improving translation quality [4]. Fig. 1 gives an example of AMRtotext generation.While early work uses statistical methods or neural models after the linearization of graphs, graph neural networks (GNNs) have been firmly established as the stateoftheart approaches for this task [9, 13]. GNNs typically compute the representation of each node iteratively based on those of its adjacent nodes. This inherently local propagation nature precludes efficient global communication, which becomes critical at larger graph sizes, as the distance between two nodes exceeds the number of stacked layers. For instance, for two nodes staying hops away, at least layers will be needed in order to capture their dependencies. Furthermore, even if two distant nodes are reachable, the information may also be disrupted in the long journey [31, 13].
To address the above problems, we propose a new model, known as Graph Transformer, which relies entirely on the multihead attention mechanism [30] to draw global dependencies.^{1}^{1}1We note that the name Graph Transformer was used in a recent work [16]. However, it merely focuses on the relations between directly connected nodes as other graph neural networks. Different to GNNs, the Graph Transformer allows modeling of dependencies between any two nodes without regard to their distance in the input graph. One undesirable consequence is that it essentially treats any graph as a fully connected graph, greatly diluting the explicit graph structure. To maintain a graph structureaware view, our proposed model introduces explicit relation encoding and incorporates it into the pairwise attention score computation as a dynamic parameter.
Our treatment of explicit relation encoding also brings other side advantages compared to GNNbased methods. Previous stateoftheart GNNbased methods use Levi graph transformation [5, 13], where two unlabeled edges are replacing one labeled edge that is present in the original graph. For example, in Fig. 1, the labeled edge turns to be two unlabeled edges and
. Since edge labels are represented as nodes, they end up sharing the same semantic space, which is not ideal as nodes and edges are typically different elements. In addition, the Levi graph transformation at least doubles the number of representation vectors. which will introduce more complexity for the decoderside attention mechanism
[2] and copy mechanism [12, 23]. Through explicit and separate relation encoding, our proposed Graph Transformer inherently avoids these problems.Experiments show that our model is able to achieve better performance for graphtosequence learning tasks for natural language processing. For the AMRtotext generation task, our model surpasses the current stateoftheart neural methods trained on LDC2015E86 and LDC2017T10 by 1.6 and 2.2 BLEU points, respectively. For the syntaxbased neural machine translation task, our model is also consistently better than others, even including ensemble systems, showing the effectiveness of the model on a large training set. In addition, we give an indepth study of the source of improvement gain and the internal workings of the proposed model.
Related Work
Early research efforts for graphtosequence learning use specialized grammarbased methods. flaniganetal2016generationflaniganetal2016generation split input graphs to trees and uses a treetostring transducer. songetal2016amrsongetal2016amr recast generation as a traveling salesman problem. jonesetal2012semanticsjonesetal2012semantics leverage hyperedge replacement grammar and songetal2017amrsongetal2017amr use a synchronous node replacement grammar. More recent work employs more general approaches, such as phrasebased machine translation model [22] and neural sequencetosequence methods [17] after linearizing input graphs. Regarding AMRtotext generation, caoclark2019factorisingcaoclark2019factorising propose an interesting idea that factorizes text generation through syntax. One limitation of sequencetosequence models, however, is that they require serialization of input graphs, which inevitably incurs the obstacle of capturing graph structure information.
An emerging trend has been directly encoding the graph with different variants of graph neural networks, which in common stack multiple layers that restrict the update of node representation based on a firstorder neighborhood but use different information passing schemes. Some borrow the ideas from recurrent neural networks (RNNs), e.g, becketal2018graphbecketal2018graph use gated graph neural network
[18]while songetal2018graphsongetal2018graph introduce LSTMstyle information aggregation. Others apply convolutional neural networks (CNNs), e.g., bastingsetal2017graphbastingsetal2017graph;damontecohen2019structuraldamontecohen2019structural;guo2019denselyguo2019densely utilize graph convolutional neural networks
[15]. koncelkedziorskietal2019textkoncelkedziorskietal2019text update vertex information by attention over adjacent neighbors. Furthermore, guo2019denselyguo2019densely allow the information exchange across different levels of layers. damontecohen2019structuraldamontecohen2019structural systematically compare different encoders and show the advantages of graph encoder over tree and sequential ones. The contrast between our model and theirs is reminiscent of the contrast between the selfattention network (SAN) and CNN/RNN.For sequencetosequence learning, the SANbased Transformer model [30] has been the de facto approach for its empirical successes. However, it is unclear on the adaptation to graphical data and its performance. Our work is partially inspired by the introduction of relative position embedding [25, 8] in sequential data. However, the extension to graph is nontrivial since we need to model much more complicated relation instead of mere visual distance. To the best of our knowledge, the Graph Transformer is the first graphtosequence transduction model relying entirely on selfattention to compute representations.
Background of SelfAttention Network
The Transformer introduced by vaswani2017attentionvaswani2017attention is a sequencetosequence neural architecture originally used for neural machine translation. It employs selfattention network (SAN) for implementing both the encoder and the decoder. The encoder consists of multiple identical blocks, of which the core is multihead attention. The multihead attention consists of attention heads, and each of them learns a distinct attention function. Given a source vector and a set of context vectors with the same dimension or in short , for each attention head, and are transformed into distinct query and value representations. The attention score is computed as the dotproduct between them.
where are trainable projection matrices. The attention scores are scaled and normalized by a softmax function to compute the final attention output .
where is the attention vector (a distribution over all input ), is a trainable projection matrix. Finally, the outputs of all attention heads are concatenated and projected to the original dimension of
, followed by feedforward layers, residual connection, and layer normalization.
^{2}^{2}2We refer interesting readers to vaswani2017attentionvaswani2017attention for more details. For brevity, we will denote the whole procedure described above as a single function .For an input sequence , the SANbased encoder computes the vector representations iteratively by , where is the total number of blocks and are word embeddings. In this way, a representation is allowed to build a direct relationship with another longdistance representation. To feed the sequential order information, the deterministic or learned position embedding [30] is introduced to expose the position information to the model, i.e., becomes the sum of the corresponding word embedding and the position embedding for .
The aforementioned treatment of SAN on sequential data can be drawn a close resemblance to graph neural networks by regarding the token sequence as an unlabeled fullyconnected graph (each token as a node) and taking the multihead attention mechanism as a specific messagepassing scheme. Such view on the relationship between SAN and graph neural networks inspires our work.
Graph Transformer
Overview
For a graph with nodes, previous graph neural networks compute the node representation as a function of the input node and all its firstorder neighborhoods . The graph structure is implicitly reflected by the receptive field of each node representation. This local communication design, however, could be inefficient for longdistance information exchange. We introduce a new model, known as Graph Transformer, which provides an aggressively different paradigm that enables relationaware global communication.
The overall framework is shown in Fig. 2. The most important characteristic of the Graph Transformer is that it has a fullyconnected view on arbitrary input graphs. A node is able to directly receive and send information to another node no matter whether they directly connected or not. These operations are achieved by our proposed extension to the original multihead attention mechanism, the relationenhanced global attention mechanism described below. Specifically, the relationship between any node pair is depicted as the shortest relation path between them. These pairwise relation paths are fed into a relation encoder for distributed relation encoding. We initialize node vectors as the sum of the node embedding and absolute position embeddings. Multiple blocks of global attention network are stacked to compute the final node representations. At each block, a node vector is updated based on all other node vectors and the corresponding relation encodings. The resulted node vectors at the last block are fed to the sequence decoder for sequence generation.
Graph Encoder
Our graph encoder is responsible for transforming an input graph into a set of corresponding node embeddings. To apply global attention on a graph, the central problem is how to maintain the topological structure of the graph while allowing fullyconnected communication. To this end, we propose relationenhanced global attention mechanism, which is an extension of the vanilla multihead attention. Our idea is to incorporate explicit relation representation between two nodes into their representation learning. Recall that, in the standard multihead attention, the attention score between the element and the element is simply the dotproduct of their query vector and key vector respectively:
(1) 
Suppose we have learned a vector representation for the relationship , which we will refer as relation encoding, between the node and the node . Following the idea of relative position embedding [25, 8], we propose to compute the attention score as follows:
(2) 
where we split the relation encoding into the forward relation encoding and the backward relation encoding . Then we compute the attention score based on both the node representations and their relation representation as shown below:
(3) 
Each term in Eq (3) corresponds to some intuitive meaning according to their formalization. The term (a) captures purely contentbased addressing, which is the original term in vanilla attention mechanism. The term (b) represents a sourcedependent relation bias. The term (c) governs a targetdependent relation bias. The term (d) encodes the universal relation bias. Our formalization provides a principled way to model the elementrelation interactions. In comparison, it has broader coverage than shawetal2018selfshawetal2018self in terms of additional terms (c) and (d), and than daietal2019transformerdaietal2019transformer in terms of the extra term (c) respectively. More importantly, previous methods only model the relative position in the context of sequential data, which merely adopts the immediate embeddings of the relative positions (e.g, ). To depict the relation between two nodes in a graph, we utilize a shortestpath based approach as described below.
Relation Encoder
Conceptually, the relation encoding gives the model a global guidance about how information should be gathered and distributed, i.e., where to attend. For most graphical structures in NLP, the edge label conveys direct relationship between adjacent nodes (e.g., the semantic role played by concepttoconcept, and the dependency relation between two words). We extend this onehop relation definition into multihop relation reasoning for characterizing the relationship between two arbitrary nodes. For example, in Fig 1, the shortest path from the concept want01 to girl is “ ”, which conveys that girl is the object of the wanted action. Intuitively, the shortest path between two nodes gives the closest and arguably the most important relationship between them. Therefore, we propose to use the shortest paths (relation sequence) between two nodes to characterize their relationship.^{3}^{3}3For the case that there are multiple shortest paths, we randomly sample one during training and take the averaged representation during testing.
Following the sequential nature of the relation sequence, we employs recurrent neural networks with Gated Recurrent Unit (GRU)
[7]for transforming relation sequence into a distributed representation. Formally, we represent the shortest relation path
between the node and the node , where indicates the edge label and are the relay nodes. We employ bidirectional GRUs for sequence encoding:The last hidden states of the forward GRU network and the backward GRU networks are concatenated to form the final relation encoding .
Bidirectionality
Though in theory, our architecture can deal with arbitrary input graphs, the most widely adopted graphs in the real problems are directed acyclic graphs (DAGs). This implies that the node embedding information will be propagated in one prespecified direction. However, the reverse direction informs the equivalent information flow as well. To facilitate communication in both directions, we add reverse edges to the graph. The reverse edge connects the same two nodes as the original edge but in a different direction and with a reversed label. For example, we will draw a virtual edge according to the original edge . For convenience, we also introduce selfloop edges for each node. These extra edges have specific labels, hence their own parameters in the network. We also introduce an extra global node into every graph, who has a direct edge to all other nodes with the special label . The final representation of the global node serves as a whole graph representation.
Absolute Position
Besides pairwise relationship, some absolute positional information can also be beneficial. For example, the root of an AMR graph serves as a rudimentary representation of the overall focus, making the minimum distance from the root node partially reflect the importance of the corresponding concept in the wholesentence semantics. The sequence order of tokens in a dependency tree also provides complementary information to dependency relations. In order for the model to make use of the absolute positions of nodes, we add the positional embeddings to the input embeddings at the bottom of the encoder stacks. For example, want01 in Fig 1 is the root node of the AMR graph, so its index should be 0. Notice we denote the index of the global node as as well.
Dataset  #train  #dev  #test  #edge types  #node types  avg #nodes  avg #edges  avg diameter 

LDC2015E86  16,833  1,368  1,371  113  18735  17.34  17.53  6.98 
LDC2017T10  36,521  1,368  1,371  116  24693  14.51  14.62  6.15 
EnglishCzech  181,112  2,656  2,999  46  78017  23.18  22.18  8.36 
EnglishGerman  226,822  2,169  2,999  46  87219  23.29  22.29  8.42 
Sequence Decoder
Our sequence decoder basically follows the same spirit of the sequential Transformer decoder. The decoder yields the natural language sequence by calculating a sequence of hidden states sequentially. One distinct characteristic is that we use the global graph representation for initializing the hidden states at each time step. The hidden state at each time step is then updated by interleaving multiple rounds of attention over the output of the encoder (node embeddings) and attention over previouslygenerated tokens (token embeddings). Both are implemented by the multihead attention mechanism. is removed when performing the sequencetograph attention.
Copy mechanism
To address the data sparsity issue in token prediction, we include a copy mechanism [12] in similar spirit to previous works. Concretely, a singlehead attention is computed based on the decoder state and the node representation , where denotes the attention weight of the node in the current time step . Our model can either directly copy the type name of a node or generate from a predefined vocabulary
. Formally, the prediction probability of a token
is given by:where is the set of nodes that have the same surface form as . and are computed by a single layer neural network with softmax activation, and , where (for ) denotes the model parameters. The copy mechanism facilitates the generation of dates, numbers, and named entities in both AMRtotext generation and machine translation tasks in experiments.
model component  hyperparameter  value 

charlevel CNN  number of filters  256 
width of filters  3  
char embedding size  32  
final hidden size  128  
Embeddings  node embedding size  300 
edge embedding size  200  
token embedding size  300  
Multihead attention  number of heads  8 
hidden state size  512  
feedforward hidden size  1024 
Model  LDC2015E86  LDC2017T10  
BLEU  chrF++  Meteor  BLEU  chrF++  Meteor  
songetal2016amrsongetal2016amr  22.4           
flaniganetal2016generationflaniganetal2016generation  23.0           
pourdamghani2016generatingpourdamghani2016generating  26.9           
songetal2017amrsongetal2017amr  25.6           
konstasetal2017neuralkonstasetal2017neural  22.0           
caoclark2019factorisingcaoclark2019factorising  23.5      26.8     
songetal2018graphsongetal2018graph  23.3      24.9     
becketal2018graphbecketal2018graph        23.3  50.4  
damontecohen2019structuraldamontecohen2019structural  24.4    23.6  24.5    24.1 
guo2019denselyguo2019densely  25.7  54.5  31.5  27.6  57.3  34.0 
Ours  27.4  56.4  32.9  29.8  59.4  35.1 
Model  Type  EnglishGerman  EnglishCzech  
BLEU  chrF++  BLEU  chrF++  
bastingsetal2017graphbastingsetal2017graph  Single  16.1    9.6   
becketal2018graphbecketal2018graph  Single  16.7  42.4  9.8  33.3 
guo2019denselyguo2019densely  Single  19.0  44.1  12.1  37.1 
becketal2018graphbecketal2018graph  Ensemble  19.6  45.1  11.7  35.9 
guo2019denselyguo2019densely  Ensemble  20.5  45.8  13.1  37.8 
Ours  Single  21.3  47.9  14.1  41.1 
Experiments
We assess the effectiveness of our models on two typical graphtosequence learning tasks, namely AMRtotext generation and syntaxbased machine translation (MT). Following previous work, the results are mainly evaluated by BLEU [20] and chrF++ [21]. Specifically, we use caseinsensitive scores for AMR and case sensitive BLEU scores for MT.
AMRtotext Generation
Our first application is language generation from AMR, a semantic formalism that represents sentences as rooted DAGs [3]. For this AMRtotext generation task, we use two benchmarks, namely the LDC2015E86 dataset and the LDC2017T10 dataset. The first block of Table 1 shows the statistics of the two datasets. Similar to konstasetal2017neuralkonstasetal2017neural, we apply entity simplification and anonymization in the preprocessing steps and restore them in the postprocessing steps.
The graph encoder uses randomly initialized node embeddings as well as the output from a learnable CNN with character embeddings as input. The sequence decoder uses randomly initialized token embeddings and another charlevel CNN. Model hyperparameters are chosen by a small set of experiments on the development set of LDC2017T10. The detailed settings are listed in Table
2. During testing, we use a beam size of for generating graphs. To mitigate overfitting, we also apply dropout [29] with the drop rate of between different layers. We use a special UNK token to replace the input node tag with a rate of . Parameter optimization is performed with the Adam optimizer [14] with and . The same learning rate schedule of vaswani2017attentionvaswani2017attention is adopted in our experiments.^{4}^{4}4Code available at https://github.com/jcyk/gtos. For computation efficiency, we gather all distinct shortest paths in a training/testing batch, and encode them into vector representations by the recurrent relation encoding procedure as described above.^{5}^{5}5This strategy reduces the number of related sequences to encode from to a stable number when a large batch size is used.We run comparisons on systems without ensembling nor additional silver data. Specifically, the comparison methods can be grouped into three categories: (1) featurebased statistical methods [27, 22, 26, 11]; (2) sequencetosequence neural models [17, 6], which use linearized graphs as inputs; (3) recent works using different variants of graph neural networks for encoding graph structures directly [28, 5, 9, 13]. The results are shown in Table 3. For both datasets, our approach substantially outperforms all previous methods. On the LDC2015E86 dataset, our method achieves a BLEU score of 27.4, outperforming previous bestperforming neural model [13] by a large margin of 2.6 BLEU points. Also, our model becomes the first neural model that surpasses the strong nonneural baseline established by pourdamghani2016generatingpourdamghani2016generating. It is worth noting that those traditional methods marked with train their language models on the external Gigaword corpus, thus they possess an additional advantage of extra data. On the LDC2017T10 dataset, our model establishes a new record BLEU score of 29.8, improving over the stateoftheart sequencetosequence model [6] by 3 points and the stateoftheart GNNbased model [13] by 2.2 points. The results are even more remarkable since the model of caoclark2019factorisingcaoclark2019factorising (marked with ) uses constituency syntax from an external parser. Similar phenomena can be found on the additional metrics of chrF++ and Meteor [10]. Those results suggest that current graph neural networks cannot make full use of the AMR graph structure, and our Graph Transformer provides a promising alternative.
Syntaxbased Machine Translation
Our second evaluation is syntaxbased machine translation, where the input is a source language dependency syntax tree and the output is a plain target language string. We employ the same data and settings from bastingsetal2017graphbastingsetal2017graph. Both the EnglishGerman and the EnglishCzech datasets from the WMT16 translation task.^{6}^{6}6http://www.statmt.org/wmt16/translationtask.html. The English sentences are parsed after tokenization to generate the dependency trees on the source side using SyntaxNet [1].^{7}^{7}7https://github.com/tensorflow/models/tree/master/syntaxnet On the Czech and German sides, texts are tokenized using the Moses tokenizer.^{8}^{8}8https://github.com/mosessmt/mosesdecoder. Bytepair encodings [24] with 8,000 merge operations are used to obtain subwords. The second block of Table 1 shows the statistics for both datasets. For model configuration, we just reuse the settings obtained in our AMRtotext experiments.
Table 4 presents the results with comparison to existing methods. On the EnglishtoGerman translation task, our model achieves a BLEU score of 41.0, outperforming all of the previously published single models by a large margin of 2.3 BLEU score. On the EnglishtoCzech translation task, our model also outperforms the best previously reported single models by an impressive margin of 2 BLEU points. In fact, our single model already outperforms previous stateoftheart models that use ensembling. The advantages of our method are also verified by the metric chrF++.
An important point about these experiments is that we did not tune the architecture: we simply employed the same model in all experiments, only adjusting the batch size for different dataset size. We speculate that even better results would be obtained by tuning the architecture to individual tasks. Nevertheless, we still obtained improved performance over previous works, underlining the generality of our model.
More Analysis
The overall scores show a great advantage of the Graph Transformer over existing methods, including the stateoftheart GNNbased models. However, they do not shed light into how this is achieved. In order to further reveal the source of performance gain, we perform a series of analysis based on different characteristics of graphs. For those analyses, we use sentencelevel chrF++ scores, and take the macro average of them when needed. All experiments are conducted with the test set of LDC2017T10.
Graph Size
To assess the model’s performance for different sizes of graphs, we group graphs into four classes and show the curves of chrF++ scores in Figure 3. The results are presented with the contrast with the stateoftheart GNNbased model of guo2019denselyguo2019densely, denoted as Guo’19. As seen, the performance of both models decreases as the graph size increases. It is expected since a larger graph often contains more complex structure and the interactions between graph elements are more difficult to capture. The gap between ours and Guo’19 becomes larger for relatively larger graphs while for small graphs, both models give similar performance. This result demonstrates that our model has better ability for dealing with complicated graphs. As for extremely large graphs, the performance of both models have a clear drop, yet ours is still slightly better.
Graph Diameter
We then study the impact of graph diameter.^{9}^{9}9The diameter of a graph is defined as the length of the longest shortest path between two nodes. Graphs with large diameters have interactions between two nodes that appear distant from each other. We conjecture that it will cause severe difficulties for GNNbased models because they solely rely on local communication. Figure 3 confirms our hypothesis, as the curve of the GNNbased model shows a clear slope. In contrast, our model has more stable performance, and the gap between the two curves also illustrates the superiority of our model on featuring longdistance dependencies.
Number of Reentrancies
We study the ability for handling the reentrancies, where the same node has multiple parent nodes (or the same concept participates in multiple relations for AMR). The recent work [9] has been identified reentrancies as one of the most difficult aspects of AMR structure. We bin the number of reentrancies occurred in a graph into four classes and plot Fig. 3. It can be observed that the gap between the GNNbased model and the Graph transformer becomes noticeably wide when more than one reentrancies start to happen. Since then, our model is consistently better than the GNNbased model, maintaining a margin of over chrf++ score.
How Far Does Attention Look At
The Graph Transformer shows a strong capacity for processing complex and large graphs. We attribute the success to the global communication design, as it provides opportunities for direct communication in long distance. A natural and interesting question is how well the model makes use of this property. To answer this question, following voitaetal2019analyzingvoitaetal2019analyzing, we turn to study the attention distribution of each attention head. Specifically, we record the specific distance of its maximum attention weight is assigned to. Fig. 4 shows the averaged the attention distance after we run on the development set of LDC2017T10. We can observe that nearly half of the attention heads have an average attention distance larger than . The number of these distance heads generally increases as layers go deeper. Interestingly, the longestreaching head (layer1head5) and the shortestsighted head (layer1head2) coexist in the very first layer, while the former has an average distance over 5.
Conclusions
In this paper, we presented the Graph Transformer, the first graphtosequence learning based entirely on automatic attention. Different from previous recurrent models that require linearization of input graph and previous graph neural network models that restrict the message passing in the firstorder neighborhood, our model enables global nodetonode communication. With the Graph Transformer, we achieve the new stateoftheart on two typical graphtosequence generation tasks with four benchmark datasets.
References
 [1] (2017) SyntaxNet models for the conll 2017 shared task. arXiv preprint arXiv:1703.04929. Cited by: Syntaxbased Machine Translation.
 [2] (2015) Neural machine translation by jointly learning to align and translate. In ICLR, Cited by: Introduction.
 [3] (2013) Abstract meaning representation for sembanking. In Proceedings of the 7th Linguistic Annotation Workshop and Interoperability with Discourse, pp. 178–186. Cited by: Introduction, AMRtotext Generation.
 [4] (2017) Graph convolutional encoders for syntaxaware neural machine translation. In EMNLP, pp. 1957–1967. Cited by: Introduction.
 [5] (2018) Graphtosequence learning using gated graph neural networks. In ACL, pp. 273–283. Cited by: Introduction, AMRtotext Generation.
 [6] (2019) Factorising AMR generation through syntax. In NAACL, pp. 2157–2163. Cited by: AMRtotext Generation.
 [7] (2014) Learning phrase representations using rnn encoderdecoder for statistical machine translation. arXiv preprint arXiv:1406.1078. Cited by: Relation Encoder.
 [8] (2019) TransformerXL: attentive language models beyond a fixedlength context. In ACL, pp. 2978–2988. Cited by: Related Work, Graph Encoder.
 [9] (2019) Structural neural encoders for AMRtotext generation. In NAACL, pp. 3649–3658. Cited by: Introduction, AMRtotext Generation, Number of Reentrancies.
 [10] (2014) Meteor universal: language specific translation evaluation for any target language. In Proceedings of the EACL 2014 Workshop on Statistical Machine Translation, Cited by: AMRtotext Generation.
 [11] (2016) Generation from abstract meaning representation using tree transducers. In NAACL, pp. 731–739. Cited by: AMRtotext Generation.
 [12] (2016) Incorporating copying mechanism in sequencetosequence learning. In ACL, pp. 1631–1640. Cited by: Introduction, Copy mechanism.
 [13] (2019) Densely connected graph convolutional networks for graphtosequence learning. Transactions of the Association for Computational Linguistics 7, pp. 297–312. Cited by: Introduction, Introduction, AMRtotext Generation.
 [14] (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: AMRtotext Generation.
 [15] (2017) Semisupervised classification with graph convolutional networks. In ICLR, Cited by: Related Work.

[16]
(2019)
Text Generation from Knowledge Graphs with Graph Transformers
. In NAACL, pp. 2284–2293. Cited by: footnote 1.  [17] (2017) Neural AMR: sequencetosequence models for parsing and generation. In ACL, pp. 146–157. Cited by: Related Work, AMRtotext Generation.
 [18] (2016) Gated graph sequence neural networks. In ICLR, Cited by: Related Work.
 [19] (2015) Toward abstractive summarization using semantic representations. In NAACL, pp. 1077–1086. Cited by: Introduction.
 [20] (2002) BLEU: a method for automatic evaluation of machine translation. In ACL, pp. 311–318. Cited by: Experiments.

[21]
(2017)
ChrF++: words helping character ngrams
. In Proceedings of the second conference on machine translation, pp. 612–618. Cited by: Experiments.  [22] (2016) Generating english from abstract meaning representations. In INLG, pp. 21–25. Cited by: Related Work, AMRtotext Generation.
 [23] (2017) Get to the point: summarization with pointergenerator networks. In ACL, pp. 1073–1083. Cited by: Introduction.
 [24] (2016) Neural machine translation of rare words with subword units. In ACL, pp. 1715–1725. Cited by: Syntaxbased Machine Translation.
 [25] (2018) Selfattention with relative position representations. In NAACL, pp. 464–468. Cited by: Related Work, Graph Encoder.
 [26] (2017) AMRtotext generation with synchronous node replacement grammar. In ACL, pp. 7–13. Cited by: AMRtotext Generation.
 [27] (2016) AMRtotext generation as a traveling salesman problem. In EMNLP, pp. 2084–2089. Cited by: AMRtotext Generation.
 [28] (2018) A graphtosequence model for AMRtotext generation. In ACL, pp. 1616–1626. Cited by: AMRtotext Generation.

[29]
(2014)
Dropout: a simple way to prevent neural networks from overfitting.
The Journal of Machine Learning Research
15 (1), pp. 1929–1958. Cited by: AMRtotext Generation.  [30] (2017) Attention is all you need. In NIPS, pp. 5998–6008. Cited by: Introduction, Related Work, Background of SelfAttention Network.
 [31] (2018) Graph2seq: graph to sequence learning with attentionbased neural networks. arXiv preprint arXiv:1804.00823. Cited by: Introduction.
Comments
There are no comments yet.