Modeling Global and Local Node Contexts for Text Generation from Knowledge Graphs

01/29/2020 ∙ by Leonardo F. R. Ribeiro, et al. ∙ Technische Universität Darmstadt 0

Recent graph-to-text models generate text from graph-based data using either global or local aggregation to learn node representations. Global node encoding allows explicit communication between two distant nodes, thereby neglecting graph topology as all nodes are connected. In contrast, local node encoding considers the relations between directly connected nodes capturing the graph structure, but it can fail to capture long-range relations. In this work, we gather the best of both encoding strategies, proposing novel models that encode an input graph combining both global and local node contexts. Our approaches are able to learn better contextualized node embeddings for text generation. In our experiments, we demonstrate that our models lead to significant improvements in KG-to-text generation, achieving BLEU scores of 17.81 on AGENDA dataset, and 63.10 on the WebNLG dataset for seen categories, outperforming the state of the art by 3.51 and 2.51 points, respectively.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1: A graphical representation (a) of a scientific text (b). (c) A global encoder directly captures longer dependencies between any pair of nodes (blue and red arrows), but fails in capturing the graph structure. (d) A local encoder explicitly accesses information from the adjacent nodes (blue arrows) and implicitly captures distant information (dashed red arrows).

Graph-to-text generation refers to the task of generating natural language text from input graph structures, which can be semantic representations konsas_17, sub-graphs from knowledge graphs (KG) koncel-kedziorski-etal-2019-text or other forms of structured data konstas-lapata-2013-inducing. While many recent works song-etal-acl2018; damonte_naacl18; ribeiro-etal-2019-enhancing; dcgcnforgraph2seq19guo focus on generating sentence-level outputs, a more challenging and interesting scenario emerges when the goal is to generate bigger multi-sentence text, such as a document or paragraph. In this context, the input graphs are much more diverse, representing knowledge from different domains and in different ways. The task is thus more demanding since it can be necessary to select relevant parts of graph for generating a concise text, and to handle document planning issues such as order, coherence and discourse markers gardent-etal-2017-webnlg.

A key issue in neural graph-to-text generation is how to encode graphs. The basic idea is to incrementally calculate node representations by aggregating context information. To this end, two main approaches have been proposed: (i) models based on local node aggregation

, usually based on Graph Neural Networks (GNN)

ribeiro-etal-2019-enhancing; dcgcnforgraph2seq19guo and (ii) models that leverage global node aggregation. Systems based on the global encoding strategy are typically based on Transformer architectures zhu-etal-2019-modeling; cai-lam-2020-graph, using self-attention to compute a node representation based on all nodes in the graph. This approach enjoys the advantage of large context range, but neglects the graph topology by effectively treating every node as being connected to all the others in the graph. In contrast, models based on local aggregation learn the representation of each node based on its adjacent nodes as defined in the input. This method effectively exploits the graph structure. However, encoding relations between distant nodes can be challenging by requiring more graph encoding layers, which can also propagate noise li2018deeper.

For example, Figure 1a presents a KG, for which a corresponding text is shown in Figure 1b. The nodes GNN and DistMulti have relations with the nodes node embeddings and link prediction, respectively. Both relations are important for GNN and DistMulti during the text generation phase, but are in different connected components. As shown in Figure 1c, a global encoder can learn a node representation for DistMulti which captures information from indirectly connected entities such as node embeddings. Encoding such dependencies is important for KG verbalisation as KGs are known to be highly incomplete, often missing links between entities Schlichtkrull2018ModelingRD. In addition, the global encoding can capture long-range complex dependencies between entities, supporting document planning. In contrast, the local strategy refines the node representation with richer neighborhood information, as nodes that share the same neighborhood exhibit a strong homophily: two entities belonging to the same topic in a KG are much more likely to be connected than at random. Consequently, the local context enriches the node representation with topic-related information from KG triples. For example, in Figure 1a, GAT reaches node embeddings through the GNN. This transitive relation can be captured by a local encoder, as shown in Figure 1d. Capturing this form of relationship also can support text generation at the sentence level.

In this paper, we investigate novel graph-to-text architectures that combine both global and local node aggregations, gathering the benefits from both strategies. In particular, we propose a unified graph-to-text framework based on Graph Attention Networks (GAT, velickovic2018graph, velickovic2018graph). As part of this framework, we empirically compare two main architectures: a cascaded architecture that performs global node aggregation before performing local node aggregation, and a parallel architecture that performs global and local aggregation simultaneously, before concatenating the representations. While the cascaded architecture allows the local encoder to leverage global encoding features, the parallel architecture allows more independent features to compliment each other. To further consider fine-grained integration, we additionally consider layer-wise integration of global and local encoders.

Extensive experiments show that our approaches consistently outperform recent models on two benchmarks for text generation from KGs, giving the best reported results so far. Compared with parallel structures, cascaded structures give better performance with smaller numbers of parameters. To the best of our knowledge, we are the first to consider integrating global and local context aggregation in graph-to-text generation, and the first to propose a unified GAT structure for integrating global and local aggregation.

2 Related Work

Early efforts for graph-to-text generation employ statistical methods flanigan-etal-2016-generation; pourdamghani-etal-2016-generating; song-etal-2017-amr. Recently, several neural graph-to-text models have exhibited success by levering different encoder mechanisms based on GNN and Transformer architectures, learning effective latent graph representations.

AMR-to-Text Generation

Recent neural models have been applied to sentence-level generation from Abstract Meaning Representation (AMR) graphs. konsas_17 provide the first neural approach for this task, by linearising the input graph as a sequence of nodes and edges. song-etal-acl2018 propose the graph recurrent network (GRN) to directly encode the AMR nodes, whereas beck-etal-2018-acl2018 develop a model based on GGNNs Li2016GatedGS. However, both approaches only employ local node aggregation strategies. damonte_naacl18 and ribeiro-etal-2019-enhancing develop models employing GNNs and LSTMs, in order to learn complementary node contexts. Recent methods zhu-etal-2019-modeling; cai-lam-2020-graph employ Transformers to learn globalized node representations, modeling graph paths in order to capture structural relations between nodes. We go a step further in this direction, combining global node representations with local neighborhood node representations learned by GNN models.

KG-to-Text Generation

Recent research efforts aim to generate fluent text from KG triples, often requiring multiple sentences. The WebNLG gardent-etal-2017-webnlg challenge consists of generating meaningful text from DBPedia graphs. In this challenge, neural encoder-decoder systems, such as ADAPT, present strong results, encoding linearized triple sets. In a recent approach, moryossef-etal-2019-step separate the generation into planning and realization stages, showing that high-quality inputs enhance the text generation process. trisedya-etal-2018-gtr develop a encoder based on LSTM that captures relationships within a triple and among triples. castro-ferreira-etal-2019-neural introduce a systematic comparison between pipeline and neural end-to-end approaches for text generation from RDF graphs. Nevertheless, those approaches consider the triples as separated structures, not explicitly considering the graph topology. To explicitly encode the graph structure, marcheggiani-icnl18 propose an encoder based on graph convolutional networks (GCN) and show superior performance compared to LSTMs. Our work is related to koncel-kedziorski-etal-2019-text who propose a transformer-based approach that only focuses on the relations between directly connected nodes. However, our models focus on both global and local node relations, capturing complementary graph contexts.

3 Graph-to-Text Model

In this section, we describe (i) the general concept of GNNs; (ii) the proposed local and global graph encoders; (iii) the graph transformation adopted to create a relational graph from the input; and (iv) the various combined global and local graph architectures.

3.1 Graph Neural Networks (GNN)

Formally, let denote a multi-relational graph222In this paper, multi-relational graphs refer to directed graphs with labelled edges. with nodes and labelled edges , where represents the relation between and

. GNNs work by iteratively learning a representation vector

of a node based on both its context node neighbors and edge features, through an information propagation scheme. More formally, the -th layer aggregates the representations of ’s context nodes:

where is an aggregation function, shared by all nodes on the -th layer. represents the relation between and . is a set of context nodes for . In most GNNs, the context nodes are those adjacent to .

The aggregated context representation is used to update the representation of :

After iterations, a node’s representation encodes the structural information within its -hop neighborhood. The choices of and differ by the specific GNN model. An example of is the sum of the representations of . An example of is a concatenation after the feature transformation.

3.2 Global Graph Encoder

A global graph encoder aggregates a global context for updating each node, by treating the graph as fully connected (see Figure 1c). We use the attention mechanism as the message passing scheme, extending the self-attention network structure of Transformer NIPS2017_7181 to a GAT structure. In particular, we compute a layer of the global convolution for a node , which takes the input feature representations as input, adopting as:

(1)

where is a model parameter. The attention weight is calculated as:

(2)

where,

(3)

is the attention function which measures the global importance of node ’s features to node . are model parameters and is a scaling factor.

Multi-head Attention.

To capture distinct relations between nodes, different global convolutions are calculated and concatenated:

(4)

Finally, we define employing layer normalization (LayerNorm) and a fully connected feed-forward network (FFN), in a similar way as the transformer architecture:

(5)

This strategy creates an artificial complete graph with edges. Note that the global encoder do not consider the edge relations between nodes. In particular, if the labelled edges were considered, the self-attention space complexity would increases to .

3.3 Local Graph Encoder

The representation captures macro relationships from to all other nodes in the graph. However, this representation lacks both structural information regarding the local neighborhood of and the graph topology. Also, it does not capture typed relations between nodes (see Equations 1 and 3).

Figure 2: Overview of the proposed encoder architectures. First, architectures with complete separated parallel (a) and cascaded (b) global and local node encoders. c) Global and local node representations are concatenated layer-wise. d) Both node representations are cascaded layer-wise.

In order to capture those crucial graph information and impose a strong relational inductive bias, we build a local graph encoder by employing a modified version of GAT augmented with relational weights. In particular, we compute a layer of the local convolution for a node , adopting as:

(6)

where encodes the relation between and . and is a set of nodes adjacent to . The attention coefficient is computed as:

(7)

where,

(8)

is the attention function which calculates the relative importance of adjacent nodes, considering typed relations.

is an activation function,

denotes concatenation and is a model parameter.

We employ multi-head attentions to learn local relations in different perspectives, as in Equation 4, generating . Finally, we define as:

(9)

where we employ as RNN a Gated Recurrent Unit (GRU)

cho-etal-2014-learning. GRU facilitates information propagation between local layers. This choice is motivated by recent works Xu2018RepresentationLO; NIPS2019_9675 that theoretically demonstrate that sharing information between layers helps the structural signals propagate. In a similar direction, AMR-to-text generation models employ LSTMs song-etal-2017-amr and dense connections dcgcnforgraph2seq19guo between GNN layers.

3.4 Graph Preparation

We represent a KG as a multi-relational graph with entity nodes and labeled edges , where denotes the relation existing from the entity to .333 contains relations both in canonical direction (e.g. used-for) and in inverse direction (e.g. used-for-inv).

Unlike other current approaches koncel-kedziorski-etal-2019-text; moryossef-etal-2019-step, we represent an entity as a set of nodes. Formally, we transform each into a new graph , where each token of an entity becomes a node . We convert each edge into a set of edges (with the same relation ) and connect every token of to every token of . That is, an edge will belong to if and only if there exists an edge such that and . We represent each node with an embedding , generated from its corresponding token.

train dev test relations avg entities avg nodes avg edges avg length
  AGENDA 38,720 1,000 1,000 7 12.4 44.3 68.6 140.3
  WebNLG 18,102 871 971 373 4.0 34.9 101.0 24.2
Table 1: Data statistics. Nodes and edges values are calculated after the graph transformation. Averages are computed per instance.

The new graph increases the representational power of the model because it allows learning node embeddings at a token level, instead of entity level. This is particularly important for text generation as it permits the model to be more flexible, capturing richer relationships between entity tokens. This also allows the model to learn relations and attention functions between source and target tokens. However, it has the side effect of removing the natural sequential order of multi-word expressions such as entities. To preserve this information, we employ position embeddings NIPS2017_7181, i.e., becomes the sum of the corresponding token embedding and the positional embedding for .

3.5 Combining Global and Local Encodings

Our goal is to implement a graph encoder capable of encoding global and local aspects of the input graph. We hypothesize that the two sources of information are complementary and a combination of both enriches node representations for text generation. In order to test this hypothesis, we investigate four possible combination architectures. Figure 2 presents our proposed encoders.

Parallel Graph Encoding.

In this setup, we compose global and local graph encoders in a fully parallel structure (Figure 2a). Note that each graph encoder can have different numbers of layers and attention heads. is the initial input for the first layer of both encoders. The final node representation is the concatenation of the local and global node representations:

Cascaded Graph Encoding.

We cascade local and global graph encoders as shown in Figure 2b, by first computing a global-contextual node embedding, and then refining it with the local context. is the initial input for the global encoder and is the initial input for the local encoder.

Layer-wise Parallel and Cascaded Graph Encoding.

To allow fine-grained interaction between the two types of contextual information, we also combine the encoders in a layer-wise fashion. In particular, for each graph layer, we employ both the local and global encoders in a parallel structure as shown in Figure 2c. We also experiment cascading the graph encoders layer-wise (Figure 2d).

3.6 Decoder

Our decoder follows the same core architecture of the Transformer decoder. Each time step is updated by interleaving multiple rounds of multi-head attention over the output of the encoder (node embeddings ) and attention over previously-generated tokens (token embeddings). An additional challenge in our setup is to generate multi-sentence outputs. In order to encourage the model to generate longer texts, we employ a length penalty DBLP:journals/corr/WuSCLNMKCGMKSJL16

to refine the pure max-probability beam search.

4 Data and Preprocessing

We attest the effectiveness of our models on two datasets: AGENDA koncel-kedziorski-etal-2019-text and WebNLG gardent-etal-2017-webnlg. Table  1 shows the statistics for both datasets.

Model #L #H BLEU METEOR CHRF++ #P
koncel-kedziorski-etal-2019-text 6 8 14.30 1.01 18.80 0.28 - -
Baseline 6 8 14.11 0.28 19.35 0.52 41.95 0.39 54.4
PGE-LW 6 8, 4 17.40 0.08 22.06 0.09 46.19 0.16 67.7
CGE-LW 6 8, 8 17.44 0.10 22.02 0.13 46.24 0.14 76.4
PGE 6, 3 8, 8 17.17 0.38 21.70 0.25 45.75 0.43 67.4
CGE 6, 3 8, 8 17.81 0.15 21.75 0.55 46.76 0.12 66.9
Table 2: Results on AGENDA test set. #L and #H are the numbers of layers and the attention heads in each layer, respectively. When more than one, the values are for the global and local encoders, respectively. #P stands for the number of parameters in millions (node embeddings included).

Agenda.

In this dataset, KGs are paired with scientific abstracts extracted from proceedings of 12 top AI conferences. Each instance consists of the paper title, a KG and the paper abstract. Entities correspond to scientific terms which are often multi-word expressions (co-referential entities are merged). We treat each token in the title as a node, creating a unique graph with title and KG tokens as nodes. As shown in Table 1

, the average output length is considerably large, as the target output are multi-sentence abstracts.

WebNLG.

In this dataset, each instance contains a graph extracted from DBPedia. The target text consists of one or more sentences that verbalise the graph. We evaluate the models on the test set with seen categories. Note that this dataset has a considerable number of edge relations (see Table 1). In order to avoid parameter explosion, we use regularization based on basis function decomposition to define the model relation weights Schlichtkrull2018ModelingRD. Also, as an alternative, we employ the Levi Transformation to create nodes from relational edges between entities beck-etal-2018-acl2018. That is, we create a new relation node for each edge relation between two nodes. The new relation node is connected to the subject and object token entities by two binary relations, respectively.

5 Experiments and Discussion

The models are trained for 30 epochs with early stopping based on the development BLEU score. We use Adam optimization with initial learning rate of 0.5. The vocabulary is shared between the node and target tokens. In order to mitigate the effects of random seeds, for the test sets, we report the averages for 4 training runs along with their standard deviation. Hyperparameters are tuned on the development set of both datasets. Following previous work

castro-ferreira-etal-2019-neural, we employ byte pair encoding (BPE) to split entity words into smaller more frequent pieces. So some nodes in the graph can be sub-words. We also obtain sub-words on the target side. We call our models PGE-LW (layer-wise parallel encoder), CGE-LW (layer-wise cascaded encoder), and PGE (fully parallel encoder) and CGE (fully cascaded encoder). We use a standard version of the Transformer as baseline and a linearized version of the triples of the KG is used as input. Following previous works, we evaluate the results in terms of BLEU Papineni:2002:BMA:1073083.1073135, METEOR Denkowski14meteoruniversal and sentence-level CHRF++ popovic-2015-chrf scores. To better attest the quality of the generated texts, we also perform a human evaluation.

5.1 Results on AGENDA

Table 2 presents the results. We report the number of layers and attention heads employed by the models. For a fair comparison, we use the same number of global layers and attention heads among different models. Our approaches substantially outperform the transformer baseline. CGE, our best model, outperforms koncel-kedziorski-etal-2019-text, a graph transformer model that only allows information exchange between adjacent nodes, by a large margin, achieving a BLEU score of 17.81, 24.5% higher. Those results indicate that combining the local node context, leveraging the graph topology, and the global node context, capturing macro-level node relations, leads to better node embeddings for text generation. The models based on layer-wise encoding have similar results with CGE-LW achieving the best METEOR score. PGE has the worse performance among the proposed models. Even though CGE has the smallest number of parameters, it achieves the better performance in terms of BLEU and CHRF++ scores.

Model BLEU METEOR CHRF++ #P
UPF-FORGe 40.88 40.00 - -
Melbourne 54.52 41.00 70.72 -
Adapt 60.59 44.00 76.01 -
Marcheggiani and Perez (2018) 55.90 39.00 - 4.9
Trisedya et al. (2018) 58.60 40.60 - -
Castro et al. (2019) 57.20 41.00 - -
CGE-RP 62.30 0.27 43.51 0.18 75.49 0.34 13.9
CGE-LG 63.10 0.13 44.11 0.09 76.33 0.10 12.8
Table 3: Results on WebNLG test set with seen categories.

5.2 Results on WebNLG

We compare the performance of our best model (CGE) with six state-of-the-art results of graph-to-text models reported for this dataset gardent-etal-2017-webnlg; trisedya-etal-2018-gtr; marcheggiani-icnl18; castro-ferreira-etal-2019-neural. Three systems are the best competitors in the challenge for seen categories: UPF-FORGe, Melbourne and Adapt. UPF-FORGe follows a rule-based approach, whereas Melbourne and Adapt employ encoder-decoder models with linearized triple sets. Table 3 presents the results.

Relations as Parameters.

CGE-RP encodes relations as model parameters and achieves a BLEU score of 62.30, 8.9% better than the best model of castro-ferreira-etal-2019-neural, who employ an end-to-end architecture based on GRUs. CGE-RP also outperform trisedya-etal-2018-gtr, an approach that encodes both intra-triple and inter-triple relationships, by 4.5 BLEU points. Interestingly, their intra-triple and inter-triple mechanisms capture relationships within a triple and among triples, approaches closely related with our local and global encodings. However, they rely on encoding sequence of relations and entities based on traversal graph algorithms, whereas we explicitly exploit the graph structure, throughout the local neighborhood aggregation.

Relations as Nodes.

CGE-LG uses Levi graphs as inputs and achieves the best performance, even thought it uses less parameters. One advantage of this approach is that it allows the model to handle new relations, as they are treated as nodes. Moreover, the relations become part of the shared vocabulary, making this information directly usable during the decoding process. We outperform an approach based on GNNs marcheggiani-icnl18 by a large margin of 7.2 BLEU points, showing our graph encoding strategies lead to a better text generation. We also outperform Adapt, a strong competitor that employs subword encodings, by 2.51 BLEU points.

Model BLEU CHRF++ #P
CGE 17.25 45.61 66.9
Global Encoder
-Global Attention 15.48 43.89 63.8
-FFN 16.48 44.85 54.3
-Global Encoder 14.96 43.18 48.0
Local Encoder
-Graph Attention 16.44 45.54 66.9
-Weight Relations 16.77 45.51 56.7
-GRU 16.19 44.54 65.4
-Local Encoder 14.43 42.43 54.4
-Shared Vocab. 15.52 44.02 86.7
Decoder
– Length Penalty 16.50 44.61 66.9
Table 4: Ablation study for modules used in the encoder and decoder of the CGE model.

5.3 Ablation Study

In Table 4, we report an ablation study on the impact of each module used in CGE model on the development set of AGENDA dataset.

Global Graph Encoder.

We start by an ablation on the global encoder. After removing the global attention coefficients, the performance of the model drops by 1.77 BLEU and 1.72 CHRF++ scores. Results also show that using FFN in the global function is important to the model but less effective than the global attentions. However, when we remove FNN, the number of parameters drops considerably (around 19%) from 66.9 to 54.3 million. Finally, without the entire global encoder, the result drops substantially by 2.29 BLEU points. This indicates that enriching node embeddings with a global context allows learning more expressive graph representations.

Figure 3: (a) Comparison between different encoder architectures with respect to (a) graph diameter and (b) number of triples, for dev set of AGENDA dataset.

Local Graph Encoder.

We first remove the local graph attention and the BLEU score drops to 16.44, showing that the neighborhood attention improves the performance. After removing the relation types, encoded as model weights, the performance drops 0.48 BLEU points. However, the number of parameters is reduced around 10 million. This indicates that we can have a more efficient model, in terms of the number of parameters, with a slight drop in performance. Removing the GRU used on the function drops the performance considerably. The worse performance occurs if we remove the entire local encoder, with a BLEU score of 14.43, essentially making the encoder similar to the baseline.

Finally, we note that the vocabulary sharing is critical to improve the performance, and the length penalty is beneficial as we generate multi-sentence outputs.

Figure 4: CHRF++ scores for AGENDA dev set, with respect to (a) the number of nodes, and (b) the graph diameter. (c) Distribution of output length of the gold references and models’ output for the AGENDA dev set.

5.4 Comparing Encoding Strategies

The overall performance on both datasets suggests the superiority of combining global and local node representations. However, to have a better understanding of the positive and negative aspects of each proposed model, we introduce a systematic comparison between the encoding strategies.

Figure 5: Relation between the number of nodes and the length of the generated text, in number of words.

Figure 3a shows the impact of graph diameter in the four encoding methods. The models perform on par for graphs with smaller diameters. Models based on layer-wise aggregations (PGE-LW and CGE-LW) have better performance when handling larger graph diameters. However, their overall performance is worse compared to the fully independent models because only 2% of the graphs on the AGENDA dev set have a diameter larger than or equal to 5. This indicates that the layer-wise encoders can better capture long-distance node dependencies. Moreover, the margin between PGE-LW and CGE-LW increases as the diameters increase, suggesting that PGE-LW can be a good option to encode graphs with larger diameter.

  #T #DP Melbourne Adapt CGE-LG
  1 222 82.27 87.54 88.22
  2 174 74.23 77.43 79.23
  3 197 67.61 73.27 72.88
  4 189 66.04 70.76 71.34
  5 144 61.87 68.38 67.37
  6 24 63.74 71.34 71.92
  7 21 59.52 73.11 74.16
  #D #DP Melbourne Adapt CGE-LG
  1 222 82.27 87.54 88.22
  2 469 69.94 74.54 75.13
   280 62.87 69.30 69.12
Table 5: CHRF++ scores with respect to the number of input triples (#T) and graph diameters (#D) on the WebNLG dev set. #DP refer to the number of datapoints.

Figure 3b shows the models’ performance with respect to the number of triples. CGE achieves better results when the number of triples is large (). On the other hand, the PGE has relatively worse when handling more information, that is, KGs with more triples.

5.5 Impact of the Graph Structure and Output Length

We investigate the performance of our best model (CGE) concerning different data properties.

Number of Triples.

In Table 5, we perform an inspection on the effect of the number of triples on the models’ performance, measured using CHRF++ scores444CHRF++ score is used as it is a sentence-level metric. for the WebNLG dev set. In general, our model obtains better scores over almost all partitions, showing that capturing explicitly structural information is beneficial for text generation. The performance decreases as the number of triples increase. However, when handling datapoints with more triples (7), Adapt and our model achieve higher performance. We hypothesize that this happens because the models receive a considerable amount of input data, giving more context to the text generation process, even though the graph structure being more complex.

Number of Nodes.

Figure 4a shows the effect of the graph size, measured in number of nodes, on the performance. Note that the score increases as the graph size increases. This trend is particularly interesting and contrasting to AMR-to-text generation, in which the models’ general performance decreases as the graph size increases cai-lam-2020-graph. In AMR benchmarks, the graph size is correlated with the sentence size, and longer sentences are more challenging to generate than the smaller ones. On the other hand, AGENDA contains similar abstract lengths555As shown on Figure 4c, 83% of the gold abstracts have more than 100 words. and when the input is a bigger graph, the model has more information to be leveraged during the generation. We also investigate the performance with respect to the number of local graph layers. The performances with 1 and 4 layers are similar, while the best performance, regardless of the number of nodes, is achieved with 3 layers.

Graph Diameter.

Figure 4b shows the impact of the graph diameter on the performance, when employing only global or local encoding modules or both, for the AGENDA dev set. Similarly to the graph size, the score increases as the diameter increases. As the global encoder is not aware of the graph structure, this module has the worst scores, even though it enables direct node communication over long distance. In contrast, the local encoder can propagate precise node information throughout the graph structure for -hop distances, making the relative performance better. We also observe that the performance gap between the global and local encoders increases when the diameter is 1. In this case, the graph has many connected components; that is, the triples do not share entities. It reveals that computing node representation based on adjacent nodes, rather than based on the entire set of entities, leads to better performance. Table 5 shows the performances for our best model and others with respect to the graph diameter for WebNLG dev set. In contrast to AGENDA, the score decreases as the diameter increases. This behavior highlights a crucial difference between the two datasets. Whereas in the WebNLG the graph size is correlated with the output size, this is not the case for AGENDA. For WebNLG, higher diameters pose additional challenges to the models as they need to generate larger outputs.

Output Length.

One interesting phenomenon to analyze is the length distribution (in number of words) of the generated outputs. We expect that our models generate texts with similar output lengths as the reference texts. However, as shown in Figure 4c, the reference texts usually are bigger than the texts generated by all models. The texts generated by CGE-no-pl, a CGE model without length penalty, are consistently longer than the baseline. Also, note that we increase the length of the texts when we employ the length penalty (see Section 3.6). However, there is still a gap between the reference and the generated text lengths. We leave further investigation of this aspect for future work.

Effect of the Number of Nodes on the Output Length.

Figure 5 shows the effect of the size of a graph, defined as the number of nodes, on the quality (measured in CHRF++ scores) and length of the generated text (in number of words) in the AGENDA dev set. We bin both the graph size and the output length in 4 classes. Our model consistently outperforms the baseline, in some cases by a large margin. When handling smaller graphs (with nodes), both models have difficulties generating good summaries. However, for these smaller graphs, our model achieves a score 12.2% better when generating texts with length . Interestingly, when generating longer summaries (length >140) from smaller graphs, our model outperforms the baseline by an impressive 21.7%, indicating that our model is more effective in capturing semantic signals from graphs with scarce information in order to generate better text. Our approach also performs better when the graph size is large (number of nodes ) but the generation output is small (), beating the baseline by 9 points.

#T Adapt CGE-LG Reference
F A F A F A
All
 1-2
 3-4
 5-7
#D Adapt CGE-LG Reference
F A F A F A
 1-2
Table 6: Fluency (F) and Adequacy (A) obtained in the human evaluation. #T refers to the number of input triples and #D to graph diameters. The ranking was determined by pair-wise Mann-Whitney tests with p < 0.05, and the difference between systems which have a letter in common is not statistically significant.

5.6 Human Evaluation

To further assess the quality of the generated text, we conduct a human evaluation on the WebNLG test set with seen categories. Following previous works gardent-etal-2017-webnlg; castro-ferreira-etal-2019-neural, we assess two quality criteria: (i) Fluency (i.e., does the text flow in a natural, easy to read manner?) and (ii) Adequacy (i.e., does the text clearly express the data?). We divide the datapoints into seven different sets by the number of triples. For each set, we randomly select 20 texts generated by Adapt, CGE-LG and their corresponding human reference text (420 texts in total). Since the number of datapoints for each set is not balanced (see Table 5), this sampling strategy assures us to have the same amount of samples for the different triple sets. Moreover, having human references may serve as an indicator of the sanity of the human evaluation experiment. We recruited human workers from Mechanical Turk to rate the text outputs on a 1-5 Likert scale. For each text, we collect scores from 4 workers and average them. Table 6 shows the results. We first note a similar trend as in the automatic evaluation, with CGE-LG outperforming Adapt on both fluency and adequacy. In sets with the number of triples smaller than 5, CGE-LG was the highest rated system in fluency. Similarly to the automatic evaluation, both systems are better in generating text from graphs with smaller diameters. Also note that bigger diameters pose difficulties to the models, which achieve their worst performance for diameters .

6 Conclusion

We introduced an unified graph attention network structure for investigating graph-to-text architectures that combined global and local graph representations in order to improve text generation. An extensive evaluation of our models demonstrated that global and local contexts are empirically complementary, and a combination can achieve state-of-the-art results on KG-to-text generation. In addition, cascaded architectures give better results compared with parallel architectures. To our knowledge, we are the first to consider both local and global aggregation in a graph attention network.

Acknowledgments

This work has been supported by the German Research Foundation as part of the Research Training Group Adaptive Preparation of Information from Heterogeneous Sources (AIPHES) under grant No. GRK 1994/1.

References