Graphical structure plays an important role in natural language processing (NLP), they often serve as the central formalism for representing syntax, semantics, and knowledge. For example, most syntactic representations (e.g., dependency relation) are tree-based while most whole-sentence semantic representation frameworks (e.g., Abstract Meaning Representation (AMR)) encode sentence meaning as directed acyclic graphs. A range of NLP applications can be framed as the process of graph-to-sequence learning. For instance, text generation may involve realizing a semantic graph into a surface form  and syntactic machine translation incorporates source-side syntax information for improving translation quality . Fig. 1 gives an example of AMR-to-text generation.
While early work uses statistical methods or neural models after the linearization of graphs, graph neural networks (GNNs) have been firmly established as the state-of-the-art approaches for this task [9, 13]. GNNs typically compute the representation of each node iteratively based on those of its adjacent nodes. This inherently local propagation nature precludes efficient global communication, which becomes critical at larger graph sizes, as the distance between two nodes exceeds the number of stacked layers. For instance, for two nodes staying hops away, at least layers will be needed in order to capture their dependencies. Furthermore, even if two distant nodes are reachable, the information may also be disrupted in the long journey [31, 13].
To address the above problems, we propose a new model, known as Graph Transformer, which relies entirely on the multi-head attention mechanism  to draw global dependencies.111We note that the name Graph Transformer was used in a recent work . However, it merely focuses on the relations between directly connected nodes as other graph neural networks. Different to GNNs, the Graph Transformer allows modeling of dependencies between any two nodes without regard to their distance in the input graph. One undesirable consequence is that it essentially treats any graph as a fully connected graph, greatly diluting the explicit graph structure. To maintain a graph structure-aware view, our proposed model introduces explicit relation encoding and incorporates it into the pairwise attention score computation as a dynamic parameter.
Our treatment of explicit relation encoding also brings other side advantages compared to GNN-based methods. Previous state-of-the-art GNN-based methods use Levi graph transformation [5, 13], where two unlabeled edges are replacing one labeled edge that is present in the original graph. For example, in Fig. 1, the labeled edge turns to be two unlabeled edges and
. Since edge labels are represented as nodes, they end up sharing the same semantic space, which is not ideal as nodes and edges are typically different elements. In addition, the Levi graph transformation at least doubles the number of representation vectors. which will introduce more complexity for the decoder-side attention mechanism and copy mechanism [12, 23]. Through explicit and separate relation encoding, our proposed Graph Transformer inherently avoids these problems.
Experiments show that our model is able to achieve better performance for graph-to-sequence learning tasks for natural language processing. For the AMR-to-text generation task, our model surpasses the current state-of-the-art neural methods trained on LDC2015E86 and LDC2017T10 by 1.6 and 2.2 BLEU points, respectively. For the syntax-based neural machine translation task, our model is also consistently better than others, even including ensemble systems, showing the effectiveness of the model on a large training set. In addition, we give an in-depth study of the source of improvement gain and the internal workings of the proposed model.
Early research efforts for graph-to-sequence learning use specialized grammar-based methods. flanigan-etal-2016-generationflanigan-etal-2016-generation split input graphs to trees and uses a tree-to-string transducer. song-etal-2016-amrsong-etal-2016-amr recast generation as a traveling salesman problem. jones-etal-2012-semanticsjones-etal-2012-semantics leverage hyperedge replacement grammar and song-etal-2017-amrsong-etal-2017-amr use a synchronous node replacement grammar. More recent work employs more general approaches, such as phrase-based machine translation model  and neural sequence-to-sequence methods  after linearizing input graphs. Regarding AMR-to-text generation, cao-clark-2019-factorisingcao-clark-2019-factorising propose an interesting idea that factorizes text generation through syntax. One limitation of sequence-to-sequence models, however, is that they require serialization of input graphs, which inevitably incurs the obstacle of capturing graph structure information.
An emerging trend has been directly encoding the graph with different variants of graph neural networks, which in common stack multiple layers that restrict the update of node representation based on a first-order neighborhood but use different information passing schemes. Some borrow the ideas from recurrent neural networks (RNNs), e.g, beck-etal-2018-graphbeck-etal-2018-graph use gated graph neural network
while song-etal-2018-graphsong-etal-2018-graph introduce LSTM-style information aggregation. Others apply convolutional neural networks (CNNs), e.g., bastings-etal-2017-graphbastings-etal-2017-graph;damonte-cohen-2019-structuraldamonte-cohen-2019-structural;guo2019denselyguo2019densely utilize graph convolutional neural networks. koncel-kedziorski-etal-2019-textkoncel-kedziorski-etal-2019-text update vertex information by attention over adjacent neighbors. Furthermore, guo2019denselyguo2019densely allow the information exchange across different levels of layers. damonte-cohen-2019-structuraldamonte-cohen-2019-structural systematically compare different encoders and show the advantages of graph encoder over tree and sequential ones. The contrast between our model and theirs is reminiscent of the contrast between the self-attention network (SAN) and CNN/RNN.
For sequence-to-sequence learning, the SAN-based Transformer model  has been the de facto approach for its empirical successes. However, it is unclear on the adaptation to graphical data and its performance. Our work is partially inspired by the introduction of relative position embedding [25, 8] in sequential data. However, the extension to graph is nontrivial since we need to model much more complicated relation instead of mere visual distance. To the best of our knowledge, the Graph Transformer is the first graph-to-sequence transduction model relying entirely on self-attention to compute representations.
Background of Self-Attention Network
The Transformer introduced by vaswani2017attentionvaswani2017attention is a sequence-to-sequence neural architecture originally used for neural machine translation. It employs self-attention network (SAN) for implementing both the encoder and the decoder. The encoder consists of multiple identical blocks, of which the core is multi-head attention. The multi-head attention consists of attention heads, and each of them learns a distinct attention function. Given a source vector and a set of context vectors with the same dimension or in short , for each attention head, and are transformed into distinct query and value representations. The attention score is computed as the dot-product between them.
where are trainable projection matrices. The attention scores are scaled and normalized by a softmax function to compute the final attention output .
where is the attention vector (a distribution over all input ), is a trainable projection matrix. Finally, the outputs of all attention heads are concatenated and projected to the original dimension of
, followed by feed-forward layers, residual connection, and layer normalization.222We refer interesting readers to vaswani2017attentionvaswani2017attention for more details. For brevity, we will denote the whole procedure described above as a single function .
For an input sequence , the SAN-based encoder computes the vector representations iteratively by , where is the total number of blocks and are word embeddings. In this way, a representation is allowed to build a direct relationship with another long-distance representation. To feed the sequential order information, the deterministic or learned position embedding  is introduced to expose the position information to the model, i.e., becomes the sum of the corresponding word embedding and the position embedding for .
The aforementioned treatment of SAN on sequential data can be drawn a close resemblance to graph neural networks by regarding the token sequence as an unlabeled fully-connected graph (each token as a node) and taking the multi-head attention mechanism as a specific message-passing scheme. Such view on the relationship between SAN and graph neural networks inspires our work.
For a graph with nodes, previous graph neural networks compute the node representation as a function of the input node and all its first-order neighborhoods . The graph structure is implicitly reflected by the receptive field of each node representation. This local communication design, however, could be inefficient for long-distance information exchange. We introduce a new model, known as Graph Transformer, which provides an aggressively different paradigm that enables relation-aware global communication.
The overall framework is shown in Fig. 2. The most important characteristic of the Graph Transformer is that it has a fully-connected view on arbitrary input graphs. A node is able to directly receive and send information to another node no matter whether they directly connected or not. These operations are achieved by our proposed extension to the original multi-head attention mechanism, the relation-enhanced global attention mechanism described below. Specifically, the relationship between any node pair is depicted as the shortest relation path between them. These pairwise relation paths are fed into a relation encoder for distributed relation encoding. We initialize node vectors as the sum of the node embedding and absolute position embeddings. Multiple blocks of global attention network are stacked to compute the final node representations. At each block, a node vector is updated based on all other node vectors and the corresponding relation encodings. The resulted node vectors at the last block are fed to the sequence decoder for sequence generation.
Our graph encoder is responsible for transforming an input graph into a set of corresponding node embeddings. To apply global attention on a graph, the central problem is how to maintain the topological structure of the graph while allowing fully-connected communication. To this end, we propose relation-enhanced global attention mechanism, which is an extension of the vanilla multi-head attention. Our idea is to incorporate explicit relation representation between two nodes into their representation learning. Recall that, in the standard multi-head attention, the attention score between the element and the element is simply the dot-product of their query vector and key vector respectively:
Suppose we have learned a vector representation for the relationship , which we will refer as relation encoding, between the node and the node . Following the idea of relative position embedding [25, 8], we propose to compute the attention score as follows:
where we split the relation encoding into the forward relation encoding and the backward relation encoding . Then we compute the attention score based on both the node representations and their relation representation as shown below:
Each term in Eq (3) corresponds to some intuitive meaning according to their formalization. The term (a) captures purely content-based addressing, which is the original term in vanilla attention mechanism. The term (b) represents a source-dependent relation bias. The term (c) governs a target-dependent relation bias. The term (d) encodes the universal relation bias. Our formalization provides a principled way to model the element-relation interactions. In comparison, it has broader coverage than shaw-etal-2018-selfshaw-etal-2018-self in terms of additional terms (c) and (d), and than dai-etal-2019-transformerdai-etal-2019-transformer in terms of the extra term (c) respectively. More importantly, previous methods only model the relative position in the context of sequential data, which merely adopts the immediate embeddings of the relative positions (e.g, ). To depict the relation between two nodes in a graph, we utilize a shortest-path based approach as described below.
Conceptually, the relation encoding gives the model a global guidance about how information should be gathered and distributed, i.e., where to attend. For most graphical structures in NLP, the edge label conveys direct relationship between adjacent nodes (e.g., the semantic role played by concept-to-concept, and the dependency relation between two words). We extend this one-hop relation definition into multi-hop relation reasoning for characterizing the relationship between two arbitrary nodes. For example, in Fig 1, the shortest path from the concept want-01 to girl is “ ”, which conveys that girl is the object of the wanted action. Intuitively, the shortest path between two nodes gives the closest and arguably the most important relationship between them. Therefore, we propose to use the shortest paths (relation sequence) between two nodes to characterize their relationship.333For the case that there are multiple shortest paths, we randomly sample one during training and take the averaged representation during testing.
Following the sequential nature of the relation sequence, we employs recurrent neural networks with Gated Recurrent Unit (GRU)
for transforming relation sequence into a distributed representation. Formally, we represent the shortest relation pathbetween the node and the node , where indicates the edge label and are the relay nodes. We employ bi-directional GRUs for sequence encoding:
The last hidden states of the forward GRU network and the backward GRU networks are concatenated to form the final relation encoding .
Though in theory, our architecture can deal with arbitrary input graphs, the most widely adopted graphs in the real problems are directed acyclic graphs (DAGs). This implies that the node embedding information will be propagated in one pre-specified direction. However, the reverse direction informs the equivalent information flow as well. To facilitate communication in both directions, we add reverse edges to the graph. The reverse edge connects the same two nodes as the original edge but in a different direction and with a reversed label. For example, we will draw a virtual edge according to the original edge . For convenience, we also introduce self-loop edges for each node. These extra edges have specific labels, hence their own parameters in the network. We also introduce an extra global node into every graph, who has a direct edge to all other nodes with the special label . The final representation of the global node serves as a whole graph representation.
Besides pairwise relationship, some absolute positional information can also be beneficial. For example, the root of an AMR graph serves as a rudimentary representation of the overall focus, making the minimum distance from the root node partially reflect the importance of the corresponding concept in the whole-sentence semantics. The sequence order of tokens in a dependency tree also provides complementary information to dependency relations. In order for the model to make use of the absolute positions of nodes, we add the positional embeddings to the input embeddings at the bottom of the encoder stacks. For example, want-01 in Fig 1 is the root node of the AMR graph, so its index should be 0. Notice we denote the index of the global node as as well.
|Dataset||#train||#dev||#test||#edge types||#node types||avg #nodes||avg #edges||avg diameter|
Our sequence decoder basically follows the same spirit of the sequential Transformer decoder. The decoder yields the natural language sequence by calculating a sequence of hidden states sequentially. One distinct characteristic is that we use the global graph representation for initializing the hidden states at each time step. The hidden state at each time step is then updated by interleaving multiple rounds of attention over the output of the encoder (node embeddings) and attention over previously-generated tokens (token embeddings). Both are implemented by the multi-head attention mechanism. is removed when performing the sequence-to-graph attention.
To address the data sparsity issue in token prediction, we include a copy mechanism  in similar spirit to previous works. Concretely, a single-head attention is computed based on the decoder state and the node representation , where denotes the attention weight of the node in the current time step . Our model can either directly copy the type name of a node or generate from a pre-defined vocabulary
. Formally, the prediction probability of a tokenis given by:
where is the set of nodes that have the same surface form as . and are computed by a single layer neural network with softmax activation, and , where (for ) denotes the model parameters. The copy mechanism facilitates the generation of dates, numbers, and named entities in both AMR-to-text generation and machine translation tasks in experiments.
|char-level CNN||number of filters||256|
|width of filters||3|
|char embedding size||32|
|final hidden size||128|
|Embeddings||node embedding size||300|
|edge embedding size||200|
|token embedding size||300|
|Multi-head attention||number of heads||8|
|hidden state size||512|
|feed-forward hidden size||1024|
We assess the effectiveness of our models on two typical graph-to-sequence learning tasks, namely AMR-to-text generation and syntax-based machine translation (MT). Following previous work, the results are mainly evaluated by BLEU  and chrF++ . Specifically, we use case-insensitive scores for AMR and case sensitive BLEU scores for MT.
Our first application is language generation from AMR, a semantic formalism that represents sentences as rooted DAGs . For this AMR-to-text generation task, we use two benchmarks, namely the LDC2015E86 dataset and the LDC2017T10 dataset. The first block of Table 1 shows the statistics of the two datasets. Similar to konstas-etal-2017-neuralkonstas-etal-2017-neural, we apply entity simplification and anonymization in the preprocessing steps and restore them in the postprocessing steps.
The graph encoder uses randomly initialized node embeddings as well as the output from a learnable CNN with character embeddings as input. The sequence decoder uses randomly initialized token embeddings and another char-level CNN. Model hyperparameters are chosen by a small set of experiments on the development set of LDC2017T10. The detailed settings are listed in Table2. During testing, we use a beam size of for generating graphs. To mitigate overfitting, we also apply dropout  with the drop rate of between different layers. We use a special UNK token to replace the input node tag with a rate of . Parameter optimization is performed with the Adam optimizer  with and . The same learning rate schedule of vaswani2017attentionvaswani2017attention is adopted in our experiments.444Code available at https://github.com/jcyk/gtos. For computation efficiency, we gather all distinct shortest paths in a training/testing batch, and encode them into vector representations by the recurrent relation encoding procedure as described above.555This strategy reduces the number of related sequences to encode from to a stable number when a large batch size is used.
We run comparisons on systems without ensembling nor additional silver data. Specifically, the comparison methods can be grouped into three categories: (1) feature-based statistical methods [27, 22, 26, 11]; (2) sequence-to-sequence neural models [17, 6], which use linearized graphs as inputs; (3) recent works using different variants of graph neural networks for encoding graph structures directly [28, 5, 9, 13]. The results are shown in Table 3. For both datasets, our approach substantially outperforms all previous methods. On the LDC2015E86 dataset, our method achieves a BLEU score of 27.4, outperforming previous best-performing neural model  by a large margin of 2.6 BLEU points. Also, our model becomes the first neural model that surpasses the strong non-neural baseline established by pourdamghani2016generatingpourdamghani2016generating. It is worth noting that those traditional methods marked with train their language models on the external Gigaword corpus, thus they possess an additional advantage of extra data. On the LDC2017T10 dataset, our model establishes a new record BLEU score of 29.8, improving over the state-of-the-art sequence-to-sequence model  by 3 points and the state-of-the-art GNN-based model  by 2.2 points. The results are even more remarkable since the model of cao-clark-2019-factorisingcao-clark-2019-factorising (marked with ) uses constituency syntax from an external parser. Similar phenomena can be found on the additional metrics of chrF++ and Meteor . Those results suggest that current graph neural networks cannot make full use of the AMR graph structure, and our Graph Transformer provides a promising alternative.
Syntax-based Machine Translation
Our second evaluation is syntax-based machine translation, where the input is a source language dependency syntax tree and the output is a plain target language string. We employ the same data and settings from bastings-etal-2017-graphbastings-etal-2017-graph. Both the English-German and the English-Czech datasets from the WMT16 translation task.666http://www.statmt.org/wmt16/translation-task.html. The English sentences are parsed after tokenization to generate the dependency trees on the source side using SyntaxNet .777https://github.com/tensorflow/models/tree/master/syntaxnet On the Czech and German sides, texts are tokenized using the Moses tokenizer.888https://github.com/moses-smt/mosesdecoder. Byte-pair encodings  with 8,000 merge operations are used to obtain subwords. The second block of Table 1 shows the statistics for both datasets. For model configuration, we just re-use the settings obtained in our AMR-to-text experiments.
Table 4 presents the results with comparison to existing methods. On the English-to-German translation task, our model achieves a BLEU score of 41.0, outperforming all of the previously published single models by a large margin of 2.3 BLEU score. On the English-to-Czech translation task, our model also outperforms the best previously reported single models by an impressive margin of 2 BLEU points. In fact, our single model already outperforms previous state-of-the-art models that use ensembling. The advantages of our method are also verified by the metric chrF++.
An important point about these experiments is that we did not tune the architecture: we simply employed the same model in all experiments, only adjusting the batch size for different dataset size. We speculate that even better results would be obtained by tuning the architecture to individual tasks. Nevertheless, we still obtained improved performance over previous works, underlining the generality of our model.
The overall scores show a great advantage of the Graph Transformer over existing methods, including the state-of-the-art GNN-based models. However, they do not shed light into how this is achieved. In order to further reveal the source of performance gain, we perform a series of analysis based on different characteristics of graphs. For those analyses, we use sentence-level chrF++ scores, and take the macro average of them when needed. All experiments are conducted with the test set of LDC2017T10.
To assess the model’s performance for different sizes of graphs, we group graphs into four classes and show the curves of chrF++ scores in Figure 3. The results are presented with the contrast with the state-of-the-art GNN-based model of guo2019denselyguo2019densely, denoted as Guo’19. As seen, the performance of both models decreases as the graph size increases. It is expected since a larger graph often contains more complex structure and the interactions between graph elements are more difficult to capture. The gap between ours and Guo’19 becomes larger for relatively larger graphs while for small graphs, both models give similar performance. This result demonstrates that our model has better ability for dealing with complicated graphs. As for extremely large graphs, the performance of both models have a clear drop, yet ours is still slightly better.
We then study the impact of graph diameter.999The diameter of a graph is defined as the length of the longest shortest path between two nodes. Graphs with large diameters have interactions between two nodes that appear distant from each other. We conjecture that it will cause severe difficulties for GNN-based models because they solely rely on local communication. Figure 3 confirms our hypothesis, as the curve of the GNN-based model shows a clear slope. In contrast, our model has more stable performance, and the gap between the two curves also illustrates the superiority of our model on featuring long-distance dependencies.
Number of Reentrancies
We study the ability for handling the reentrancies, where the same node has multiple parent nodes (or the same concept participates in multiple relations for AMR). The recent work  has been identified reentrancies as one of the most difficult aspects of AMR structure. We bin the number of reentrancies occurred in a graph into four classes and plot Fig. 3. It can be observed that the gap between the GNN-based model and the Graph transformer becomes noticeably wide when more than one reentrancies start to happen. Since then, our model is consistently better than the GNN-based model, maintaining a margin of over chrf++ score.
How Far Does Attention Look At
The Graph Transformer shows a strong capacity for processing complex and large graphs. We attribute the success to the global communication design, as it provides opportunities for direct communication in long distance. A natural and interesting question is how well the model makes use of this property. To answer this question, following voita-etal-2019-analyzingvoita-etal-2019-analyzing, we turn to study the attention distribution of each attention head. Specifically, we record the specific distance of its maximum attention weight is assigned to. Fig. 4 shows the averaged the attention distance after we run on the development set of LDC2017T10. We can observe that nearly half of the attention heads have an average attention distance larger than . The number of these distance heads generally increases as layers go deeper. Interestingly, the longest-reaching head (layer1-head5) and the shortest-sighted head (layer1-head2) coexist in the very first layer, while the former has an average distance over 5.
In this paper, we presented the Graph Transformer, the first graph-to-sequence learning based entirely on automatic attention. Different from previous recurrent models that require linearization of input graph and previous graph neural network models that restrict the message passing in the first-order neighborhood, our model enables global node-to-node communication. With the Graph Transformer, we achieve the new state-of-the-art on two typical graph-to-sequence generation tasks with four benchmark datasets.
-  (2017) SyntaxNet models for the conll 2017 shared task. arXiv preprint arXiv:1703.04929. Cited by: Syntax-based Machine Translation.
-  (2015) Neural machine translation by jointly learning to align and translate. In ICLR, Cited by: Introduction.
-  (2013) Abstract meaning representation for sembanking. In Proceedings of the 7th Linguistic Annotation Workshop and Interoperability with Discourse, pp. 178–186. Cited by: Introduction, AMR-to-text Generation.
-  (2017) Graph convolutional encoders for syntax-aware neural machine translation. In EMNLP, pp. 1957–1967. Cited by: Introduction.
-  (2018) Graph-to-sequence learning using gated graph neural networks. In ACL, pp. 273–283. Cited by: Introduction, AMR-to-text Generation.
-  (2019) Factorising AMR generation through syntax. In NAACL, pp. 2157–2163. Cited by: AMR-to-text Generation.
-  (2014) Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078. Cited by: Relation Encoder.
-  (2019) Transformer-XL: attentive language models beyond a fixed-length context. In ACL, pp. 2978–2988. Cited by: Related Work, Graph Encoder.
-  (2019) Structural neural encoders for AMR-to-text generation. In NAACL, pp. 3649–3658. Cited by: Introduction, AMR-to-text Generation, Number of Reentrancies.
-  (2014) Meteor universal: language specific translation evaluation for any target language. In Proceedings of the EACL 2014 Workshop on Statistical Machine Translation, Cited by: AMR-to-text Generation.
-  (2016) Generation from abstract meaning representation using tree transducers. In NAACL, pp. 731–739. Cited by: AMR-to-text Generation.
-  (2016) Incorporating copying mechanism in sequence-to-sequence learning. In ACL, pp. 1631–1640. Cited by: Introduction, Copy mechanism.
-  (2019) Densely connected graph convolutional networks for graph-to-sequence learning. Transactions of the Association for Computational Linguistics 7, pp. 297–312. Cited by: Introduction, Introduction, AMR-to-text Generation.
-  (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: AMR-to-text Generation.
-  (2017) Semi-supervised classification with graph convolutional networks. In ICLR, Cited by: Related Work.
Text Generation from Knowledge Graphs with Graph Transformers. In NAACL, pp. 2284–2293. Cited by: footnote 1.
-  (2017) Neural AMR: sequence-to-sequence models for parsing and generation. In ACL, pp. 146–157. Cited by: Related Work, AMR-to-text Generation.
-  (2016) Gated graph sequence neural networks. In ICLR, Cited by: Related Work.
-  (2015) Toward abstractive summarization using semantic representations. In NAACL, pp. 1077–1086. Cited by: Introduction.
-  (2002) BLEU: a method for automatic evaluation of machine translation. In ACL, pp. 311–318. Cited by: Experiments.
ChrF++: words helping character n-grams. In Proceedings of the second conference on machine translation, pp. 612–618. Cited by: Experiments.
-  (2016) Generating english from abstract meaning representations. In INLG, pp. 21–25. Cited by: Related Work, AMR-to-text Generation.
-  (2017) Get to the point: summarization with pointer-generator networks. In ACL, pp. 1073–1083. Cited by: Introduction.
-  (2016) Neural machine translation of rare words with subword units. In ACL, pp. 1715–1725. Cited by: Syntax-based Machine Translation.
-  (2018) Self-attention with relative position representations. In NAACL, pp. 464–468. Cited by: Related Work, Graph Encoder.
-  (2017) AMR-to-text generation with synchronous node replacement grammar. In ACL, pp. 7–13. Cited by: AMR-to-text Generation.
-  (2016) AMR-to-text generation as a traveling salesman problem. In EMNLP, pp. 2084–2089. Cited by: AMR-to-text Generation.
-  (2018) A graph-to-sequence model for AMR-to-text generation. In ACL, pp. 1616–1626. Cited by: AMR-to-text Generation.
Dropout: a simple way to prevent neural networks from overfitting.
The Journal of Machine Learning Research15 (1), pp. 1929–1958. Cited by: AMR-to-text Generation.
-  (2017) Attention is all you need. In NIPS, pp. 5998–6008. Cited by: Introduction, Related Work, Background of Self-Attention Network.
-  (2018) Graph2seq: graph to sequence learning with attention-based neural networks. arXiv preprint arXiv:1804.00823. Cited by: Introduction.