Investigating Pretrained Language Models for Graph-to-Text Generation

07/16/2020 ∙ by Leonardo F. R. Ribeiro, et al. ∙ 0

Graph-to-text generation, a subtask of data-to-text generation, aims to generate fluent texts from graph-based data. Many graph-to-text models have shown strong performance in this task employing specialized graph encoders. However, recent approaches employ large pretrained language models (PLMs) achieving state-of-the-art results in data-to-text generation. In this paper, we aim to investigate the impact of large PLMs in graph-to-text generation. We present a study across three graph domains: meaning representations, Wikipedia knowledge graphs (KGs) and scientific KGs. Our analysis shows that PLMs such as BART and T5 achieve state-of-the-art results in graph-to-text benchmarks without explicitly encoding the graph structure. We also demonstrate that task-adaptive pretraining strategies are beneficial to the target task, improving even further the state of the art in two benchmarks for graph-to-text generation. In a final analysis, we investigate possible reasons for the PLMs' success on graph-to-text tasks. We find evidence that their knowledge about the world gives them a big advantage, especially when generating texts from KGs.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Graphs are important structures in natural language processing (NLP) as they represent complex relations between a set of objects. For example, syntactic and semantic structures of sentences can be represented using different graph representations

bastings-etal-2017-graph; banarescu-etal-2013-abstract and Knowledge Graphs (KGs) are used to encode factual knowledge in the form of relations between entities gardent-etal-2017-webnlg.

Graph-to-text generation, a subtask of data-to-text generation 10.5555/3241691.3241693, aims to create meaningful and coherent natural language text to describe an input graph. Recent efforts for graph-to-text generation song-etal-acl2018; damonte-cohen-2019-structural; ribeiro-etal-2019-enhancing; ribeiro2020modeling

focus on effectively encoding the input graph, employing graph encoders usually built upon Graph Neural Networks (GNN)

Kipf:2016tc; NIPS2017_6703 or Transformers NIPS2017_7181. These models better represent input graphs than standard text-to-text models where input graphs are linearized, neglecting structural graph information konsas_17; colin-gardent-2018-generating. In graph-to-text models, the encoder computes representations for a graph-based input that should be invariant to the node order, whereas the decoder generates text as a linear chain structure. The complex nature of graphs increases the difficulty of defining alignments between the source nodes/edges and the target tokens, broadening the structural gap between the encoder and decoder zhao-etal-2020-bridging.

Figure 1: Examples of (a) AMR and (b) WebNLG graphs, the reference texts and the input for the graph-to-text models.

Transfer learning has become ubiquitous in NLP and pretrained Transformer-based architectures considerably have outperformed prior state of the art devlin-etal-2019-bert; liu2020roberta; radford2019language. Following this trend, very recent work mager2020gpttoo; harkous2020text; kale2020texttotext; radev2020dart apply transfer learning for data-to-text generation, where a language model is first pretrained on massive corpora before being fine-tuned on the target task. More interestingly, some of these approaches are successfully employed in graph-to-text generation, even though they do not explicitly encode the graph structure.

In this paper, we investigate pretrained language model (PLM) approaches for graph-to-text generation. We present a study across three graph domains (meaning representations, Wikipedia KGs and scientific KGs) and show strong results even though the models do not encode any graph-specific structural bias. In particular, we examine two recent text-to-text models: BART lewis2019bart and T5 2019t5. Those architectures are suitable for conditional text generation as they both follow a Transformer encoder-decoder architecture. Their main differences are different pretraining strategies and the datasets, on which they were trained.

Our contributions are three fold:

  • We execute extensive experiments with BART and T5 and show that PLMs consistently outperform recent specialized graph-to-text models on three benchmarks for graph-to-text generation.

  • We collect additional task-specific data and propose a supervised and unsupervised task-adaptive pretraining, which further improves the performance on two graph-to-text benchmarks. Additionally, we will release these datasets.

  • We show that even though encoding the structural graph bias improves results in previous models that were trained from scratch, PLMs even perform well when trained on a shuffled version of the graph input. In further analysis, we show that the knowledge acquired during pretraining gives PLMs such a significant advantage on certain graph-to-text benchmarks that they do not need to understand the input graph structure for good performance.

2 Related Work

Graph-to-text generation can be divided into two main tasks: (i) MR-to-text generation - generating text from meaning representations konsas_17 and (ii) KG-to-text generation - generating text from knowledge graphs (KG) gardent-etal-2017-webnlg.

AMR-to-Text Generation.

Abstract Meaning Representation (AMR; banarescu-etal-2013-abstract) is a semantic formalism that represents the meaning of a sentence as a rooted directed graph expressing “who is doing what to whom”. In an AMR graph, nodes represent concepts and edges represent semantic relations. Various neural models have been proposed to generate sentences from AMR graphs. konsas_17 propose the first neural approach for AMR-to-text generation that uses a linearized input graph. song-etal-acl2018 and beck-etal-2018-acl2018 propose GNNs based on recurrent mechanisms to directly encode the AMR graph structure. damonte-cohen-2019-structural investigate combinations of graph convolutional networks (GCN) and LSTMs in order to encode the AMR nodes. ribeiro-etal-2019-enhancing develop a graph encoder that computes a top-down and a bottom-up representation of the AMR graph whereas dcgcnforgraph2seq19guo employ dense connections between GNN layers. Recent methods zhu-etal-2019-modeling; cai-lam-2020-graph; doi:10.116200297; song-etal-2020-structural employ Transformers to learn node representations injecting the graph structure into the self-attention aggregation.

KG-to-Text Generation.

Recent neural approaches for KG-to-text generation linearize the KG triples ignoring the graph structure moryossef-etal-2019-step; castro-ferreira-etal-2019-neural. trisedya-etal-2018-gtr develop an LSTM encoder that encodes relationships within and between triples. marcheggiani-icnl18 propose to employ GNNs to capture node contexts, and demonstrate superior performance compared to LSTMs. koncel-kedziorski-etal-2019-text propose a Transformer-based approach which encodes the input graph by computing node representations based on the node context of direct neighbors. Recently, ribeiro2020modeling propose a unified Graph Attention Network (GAT) framework that encodes both global and node contexts in order to better capture the graph topology.

Pretrained Models.

Pretraining Transformer-based methods, such as BERT devlin-etal-2019-bert, GPT-2 radford2019language, XLNet NIPS2019_8812, or RoBERTa liu2020roberta, have established a qualitatively new level of baseline performance for many widely used Natural Language Understanding (NLU) benchmarks, including the popular GLUE wang-etal-2018-glue. mager2020gpttoo is the first approach that employs a pretrained Transformer-based language model (GPT-2) for AMR-to-text generation. Very recently, harkous2020text and kale2020texttotext demonstrate state-of-the-art results in different data-to-text benchmarks, employing GPT-2 and T5 models respectively. Concurrent to our work, radev2020dart propose DART, a large dataset for data-to-text generation, and employ BART in the WebNLG dataset, augmenting the training data with DART and achieving good performance in the out-of-domain setting.

3 Fine-tuning Pretrained Models

In this paper, we investigate two PLMs based on the Transformer encoder-decoder architecture NIPS2017_7181: BART and T5. The main differences between these two models are how they are trained and the input corpora for pretraining. Our main motivation for using BART and T5 is that their encoder-decoder architectures are particularly flexible and thus likely well-suited for conditioned text generation tasks.

Bart.

BART lewis2019bart

is pretrained as a text-to-text denoising autoencoder. In particular, in BART, ReLU activation functions are replaced by GeLUs. The pretraining stage has two phases. First, the input text is corrupted with an arbitrary noising function, and, second, a text-to-text model is learned in order to reconstruct the original text. The model is pretrained on a combination of books and Wikipedia data. In our experiments, we evaluate the impact of the model capacity with two versions:

base with 140M parameters and large with 400M parameters.

T5.

The T5 2019t5

model aims to convert different problems into a text-to-text format. T5 follows the original Transformer architecture with a modification regarding the positional embeddings. Instead of using a fixed embedding for each position, they employ a simplified form of position embeddings where each embedding is simply a scalar that is added to the corresponding logit used for computing the attention weights. T5 is pretrained on a clean up version of Common Crawl’s web and demonstrated state-of-the-art performance on datasets such as CNN/Daily Mail

chen-etal-2016-thorough and SuperGLUE NIPS2019_8589. We experiment with T5 models with different capacities: small with 60M, base with 220M, and large with 770M parameters.

     Model BLEU METEOR chrF++ BS (F1) MS BT
     damonte-cohen-2019-structural 23.30 - 50.40 - - -
     ribeiro-etal-2019-enhancing 27.87 33.21 - - - -
     zhu-etal-2019-modeling 31.82 36.38 64.05 - - -
     dcgcnforgraph2seq19guo 27.60 - 57.30 - - -
     cai-lam-2020-graph 29.80 35.10 59.40 - - -
     zhao-etal-2020-line 32.46 36.78 - - - -
     yao-etal-2020-heterogeneous 34.10 38.10 65.60 - - -
     based on pretrained language models
     mager2020gpttoo 33.02 37.68 63.89 - - -
     harkous2020text 37.70 38.90 - - - -
     BART-Base 36.43 36.95 64.24 94.15 54.55 45.64
     BART-Large 43.72 41.27 71.27 95.59 62.92 55.01
     T5-Small 39.15 38.98 66.39 94.64 58.64 48.37
     T5-Base 43.37 40.43 69.22 95.12 60.87 50.75
     T5-Large 43.18 40.10 66.79 95.20 61.01 52.28
Table 1: Results on AMR-to-text generation for the AMR17 test set. BS, MS and BT stands for BertScore, MoverScore and BLEURT, respectively.

Fine-tuning.

In order to explore transfer learning with large PLMs for graph-to-text generation, we train BART and T5 for some epochs on a supervised downstream dataset, in a process referred to as fine-tuning. We also experiment with an intermediate pretraining step before the fine-tuning phase. In this intermediate pretraining step, we train the models with additional task-specific data, employing supervised and unsupervised approaches (see Section 

6). For T5, in the supervised setup, we add a prefix “translate from Graph to Text:” before the graph input. We add this prefix to imitate the setup used during T5’s original pretraining, when translating between different languages. The motivation for adding this kind of prefix is that it could help the model to identify the new task as a form of translation and to better distinguish between graphs and natural language texts.

4 Datasets

In this work, we evaluate text-to-text pretrained models in three graph-to-text benchmarks: AMR17 (LDC2017T10222Downloaded from https://amr.isi.edu/download.html.), WebNLG gardent-etal-2017-webnlg, and AGENDA koncel-kedziorski-etal-2019-text. Table 2 shows statistics for each dataset.

#train #dev #test
AGENDA 38,720 1,000 1,000
WebNLG 18,102 872 971
AMR17 36,521 1,368 1,371
Table 2: Data statistics.

Amr17.

The LDC2017T10 corpus contains instances with a sentence annotated with its corresponding AMR graph. The AMR graphs are processed and represented using PENMAN notation (see Figure 1) as it was shown to have better performance compared to other processing techniques mager2020gpttoo. We use the preprocessing script by ribeiro-etal-2019-enhancing to linearize the graph.

WebNLG.

Each instance of WebNLG contains a KG extracted from DBPedia333DBPedia is constructed extracting structured content from Wikipedia. 10.5555/1785162.1785216 in the form of triples and a target text, consisting of one or multiple sentences, describing the graph. The test set is divided into three partitions: seen, which contains only DBPedia categories presented in the train set; unseen, which covers categories never seen during training; and all, which includes categories from both seen and unseen sets. Similarly to previous work harkous2020text; kale2020texttotext, we linearize the triples adding special tokens. In particular, we prepend the tokens , , and before the head entity, the relation and tail entity of a triple, respectively.

Agenda.

This dataset is constructed from the scientific AI domain. In particular, KGs are paired with scientific abstracts extracted from proceedings of 12 top AI conferences. Each sample contains the paper title, a KG and the corresponding paper abstract. A KG contains entities corresponding to scientific terms, which are often multi-word expressions, and the edges represent relations between the entities. This dataset has loose alignments between the input graph and the corresponding text as the graphs were automatically generated by employing an information extraction system koncel-kedziorski-etal-2019-text. The input for the models is a text containing the title, a sequence of all KG entities, and the linearized triples. The target text is the paper abstract. We add special tokens into the triples in the same way as for WebNLG. In contrast to WebNLG, we had to list all entities given in the input additionally to listing all the triples because AGENDA contains a lot of instances with isolated entities, i.e., entities that do not occur in any triple. In WebNLG, this is never the case.

BLEU METEOR chrF++
Model A S U A S U A S U
 Melbourne 45.13 54.52 33.27 37.00 41.00 33.00 - - -
trisedya-etal-2018-gtr 37.10 54.00 29.20 31.00 37.00 28.00 - - -
castro-ferreira-etal-2019-neural 51.68 56.35 38.92 32.00 41.00 21.00 - - -
moryossef-etal-2019-step 47.24 53.30 34.41 39.00 44.00 37.00 - - -
schmitt2020modeling - 59.39 - - 42.83 - - 74.68 -
ribeiro2020modeling - 63.69 - - 44.47 - - 76.66 -
zhao-etal-2020-bridging 52.78 64.42 38.23 41.00 46.00 37.00 - - -
based on pretrained language models
harkous2020text 52.90 - - 42.40 - - - - -
kale2020texttotext 57.10 63.90 52.80 44.00 46.00 41.00 - - -
 BART-Base 49.81 58.71 38.47 39.20 42.94 35.05 67.65 73.30 61.50
 BART-Large 49.49 62.46 35.87 41.47 45.68 36.92 71.08 77.92 63.61
 T5-Small 55.02 62.53 45.83 41.86 44.78 38.57 71.54 76.20 66.46
 T5-Base 57.46 62.93 50.88 43.05 44.99 40.81 73.31 76.57 69.75
 T5-Large 59.29 63.06 54.69 43.99 45.35 42.41 74.54 76.89 71.98
Table 3: Results on WebNLG test set. A, S and U stands for all, seen and unseen partitions of the WebNLG test sets, respectively.

5 Experiments

We develop our experiments based on BART and T5 pretrained models released by wolf2019huggingfaces. We use Adam optimization kingma:adam as the optimizer with an initial learning rate of . We do not employ warm-up in our learning rate schedule. For decoding, we use beam search with the beam sizes of 5, 3, and 5 for AMR17, WebNLG, and AGENDA, respectively. Following previous works, we evaluate the results using BLEU Papineni:2002:BMA:1073083.1073135, METEOR Denkowski14meteoruniversal and CHRF++ popovic-2015-chrf automatic metrics.444For BLEU, we use the multi-BLEU script from the MOSES decoder suite Koehn:2007:MOS:1557769.1557821. For METEOR, we use the original meteor-1.5.jar script (https://github.com/cmu-mtlab/meteor) We also evaluate the results using MoverScore zhao-etal-2019-moverscore, BERTScore bert-score and BLEURT sellam-etal-2020-bleurt metrics, as they employ contextual embeddings to incorporate semantic knowledge. We report the test results when the BLEU score on development set is optimal.

5.1 Results on AMR-to-Text

Table 1 shows the results of our experiments and several recent results reported on the AMR17 test set. Note that we also report results from models that leverage PLMs. mager2020gpttoo and harkous2020text employ the GPT-2 model in their approaches. Note whereas GPT-2 is a Transformer-based decoder, our approaches employ BART and T5 that follow a Transformer-based encoder-decoder architecture. T5-Large achieves a considerable improvement of 5.48 BLEU and 1.2 METEOR scores over the previous state-of-the-art model that uses GPT2. BART achieves the new state-of-the-art BLEU score of 43.72, 15.9% higher. Note that the semantic metrics follow similar trends.

5.2 Results on WebNLG

We compare the performance of our experiments with many state-of-the-art results reported on this dataset. Table 3 shows the results for the WebNLG test set. Melbourne is one of the best competitors in the WebNLG challenge. moryossef-etal-2019-step and castro-ferreira-etal-2019-neural develop neural pipeline approaches with explicit intermediate steps in the generation process and achieve strong performance in the unseen dataset. On the other hand, fully end-to-end models ribeiro2020modeling; schmitt2020modeling have strong performance on seen dataset and usually perform poorly in unseen data. A very recent model zhao-etal-2020-bridging achieves the best performance in the seen test set. However, this model leverages additional information about the order that the triples are realized into the target text. T5 achieves 59.29 and 54.69 BLEU points on all seen and unseen datasets. Note that  kale2020texttotext also employ T5 in this benchmark. In particular, in our T5 setup, we add a prefix before the input graph (see section 3). BART performs much worse than T5 on this dataset, with around 10 BLEU points lower compared to T5 in all seen partition.

     Model BLEU METEOR chrF++ BS (F1) MS BT
     koncel-kedziorski-etal-2019-text 14.30 18.80 - - - -
     An2019RepulsiveBS 15.10 19.50 - - - -
     schmitt2020modeling 17.33 21.43 44.53 - - -
     ribeiro2020modeling 18.01 22.23 46.37 - - -
     BART-Base 22.01 23.54 48.02 89.36 34.33 -13.02
     BART-Large 23.65 25.19 50.44 88.74 32.24 -10.93
     T5-Small 20.22 21.62 44.91 88.56 30.25 -24.10
     T5-Base 20.73 21.88 48.14 88.81 31.33 -21.03
     T5-Large 22.15 23.73 48.14 89.60 35.23 -13.96
Table 4: Results on AGENDA test set. BS, MS and BT stands for BertScore, MoverScore and BLEURT, respectively.

5.3 Results on AGENDA

In Table 4 we show the results for the AGENDA test set. We compare our approaches against several recent results reported on this dataset. It is worth mentioning that AGENDA graphs are incomplete regarding the target text. In contrast to WebNLG, this dataset is from scientific domain and the KGs and texts are extracted from the AI domain koncel-kedziorski-etal-2019-text. The models also have strong performance on this dataset. We believe that their capacity to generate fluent text helps when generating paper abstracts, even though they were not pretrained in the scientific domain. We note that an intermediate task-adaptive pretraining can enable the models to achieve better performance. Finally, note that BART has impressive performance with its large variant achieving a score of 23.65 BLEU points, 5.6 points higher than the previous state of the art.

Figure 2: Task-adaptive pretraining with additional data. We experiment employing supervised and unsupervised approaches.

6 Task-adaptive Pretraining

Following recent previous work gururangan-etal-2020-dont, we would like to investigate whether leveraging additional task-specific data can further improve the performance on graph-to-text generation. Task-specific data refers to a smaller pretraining corpus that is more task-relevant. In order to leverage the task-specific data, we add an intermediate pretraining step between the original pretraining and fine-tuning phases for graph-to-text generation. Figure 2 shows the proposed training pipeline strategy.

More precisely, we first continue pretraining BART and T5 using an unsupervised or supervised training. In the supervised approach, we use pairs of graphs and corresponding texts collected from the same or similar domain as the target task. In the unsupervised approach, we follow BART and T5 pretraining strategies, corrupting the target text. Note that we do not use the graphs in the unsupervised pretraining, but only the target text of our task-specific data collections. In particular, we randomly mask text spans, replacing 15% of the tokens.555Please, refer to lewis2019bart and 2019t5 for details about the unsupervised pretraining strategies. Finally, we fine-tune the models using the original training set.

We employ this strategy for two graph domains: meaning representations (AMR17) and scientific data (AGENDA), in which we collected additional task-specific data.

AMR Silver Data.

In order to generate additional data for AMR, we sample 200K Gigaword666https://catalog.ldc.upenn.edu/LDC2003T05. sentences and use JAMR777https://github.com/jflanigan/jamr. flanigan-etal-2016-cmu to parse them into AMR graphs. Note that the newswire data in Gigaword is also one of the data sources in the AMR17 benchmark. We use the 200K silver graphs to pretrain the models in a supervised way before fine-tuning them on gold AMR graphs. For the unsupervised pretraining, we only use the sentences.

Title Abstract KG
Vocab 48K 173K 113K
Tokens 2.1M 31.7M 9.6M
Entities - - 3.7M
Avg Length 11.1 167.1 -
Avg #Nodes - - 19.9
Avg #Edges - - 9.4
Table 5: Statistics for KGAIA. Averages are computed per instance.

Semantic Scholar AI Data.

We collect titles and abstracts of around 190K scientific papers from the Semantic Scholar ammar-etal-2018-construction taken from the proceedings of 36 top Computer Science/AI conferences. We construct KGs from the paper abstracts employing DyGIE++ wadden-etal-2019-entity, an automatic information extraction system. Note that the AGENDA dataset was constructed using the SciIE system luan-etal-2018-multi, which also extracts KGs from AI scientific papers. However, in our new dataset, the domain is broader as we collected data from 36 conferences compared to 12 from AGENDA. Furthermore, to prevent data leakage, all AGENDA samples used for performance evaluation are removed from the dataset. We will call this dataset KGAIA (KGs from AI Abstracts). Table 5 shows the dataset statistics.

AMR17
Model BLEU METEOR chrF++
BART 43.72 41.27 71.27
BART-TSU 43.94 41.15 71.14
BART-TSS 45.54 42.50 72.78
T5 43.18 40.10 66.79
T5-TSU 45.91 41.82 71.19
T5-TSS 48.85 43.12 73.10
AGENDA
Model BLEU METEOR chrF++
BART 23.65 25.19 50.44
BART-TSU 25.30 25.54 51.33
BART-TSS 25.66 25.74 51.63
T5 22.15 23.73 48.14
T5-TSU 22.92 24.40 49.37
T5-TSS 23.69 24.92 50.27
Table 6: Impact of additional task-adaptive pretraining of BART-Large and T5-Large on AMR17 and AGENDA datasets. TSU and TSS refer to unsupervised and supervised pretraining on the additional task-specific data.

6.1 Results

Table 6

compares the models’ performance with either no additional pretraining, with additional unsupervised pretraining solely on the target texts of the respective in-domain corpus (TSU), and with additional task-adaptive supervised pretraining (TSS), before being fine-tuned on the respective train sets. Unsupervised pretraining helps less than supervised pretraining but also brings gains in most cases. This suggests that the performance increases after additional supervised pretraining do not only come from seeing more task-specific target texts but that the models learn how to handle graphs and the graph-to-text correspondence. Also note that we use the same number of datapoints for unsupervised and supervised experiments. It is probable that unsupervised pretraining on larger amounts of in-domain data would improve the models even more.

Amr17.

When fine-tuned directly after the original pretraining, BART has better performance than T5. However, when pretraining on the task-specific data, T5 outperforms BART by 3.3 BLEU points, achieving 48.85 BLEU points, the new state of the art for the AMR-to-text generation.

Agenda.

BART benefits more from the task-adaptive pretraining, achieving the new state of the art of 25.66 BLEU points, a gain of 2 BLEU points. The improvements from task-adaptive pretraining are not as large as in the AMR dataset. We believe that this is due to the fact that the input graphs do not completely cover the target text, making this dataset more challenging.

Pretrained on Evaluated on BLEU
None AGENDA 22.01
KGAIA AGENDA 23.48
WebNLG AGENDA 21.98
None WebNLG-Seen 58.71
KGAIA WebNLG-Seen 63.20
AGENDA WebNLG-Seen 61.25
Table 7: Impact of cross-domain supervised pretraining for BART-Base on KG-to-text generation.

6.2 Cross-domain Pretraining

T/F Input Fact order shuf
(1) F Ohio is Part Of Cleveland Ohio is part of Cleveland. Cleveland is part of Ohio.
(2) F United States is Part Of Amarillo, Texas Amarillo, Texas is part of Amarillo, Texas is part of
the United States. the United States.
(3) F Leinster is Part Of Dublin Leinster is part of Dublin. Leinster is part of Dublin.
(4) F Rome capital Italy Rome’s capital is Italy. Rome is the capital of Italy.
(5) T Italy capital Rome Italy’s capital is Rome. Rome is the capital of Italy.
(6) F rome capital italy The capital of rome is italy. Italy is the capital of rome.
(7) T italy capital rome Italy’s capital is rome. Italy’s capital is rome.
Table 8: Example generations from corrupted (F) and true (T) WebNLG dev set facts by T5-Small models fine-tuned on correctly ordered nodes (order) and randomly shuffled nodes (shuf) from the WebNLG training set. denotes the separator tag.

We also investigate how cross-domain pretraining affects the performance after fine-tuning. Table 7 shows the results executed using BART. While the target texts in KGAIA and AGENDA share the domain of scientific abstracts, texts in WebNLG are more general. The graphs in all three datasets are KGs, but WebNLG graphs do not share any relations with the other KGs. Nevertheless, supervised pretraining increases the performance in the cross-domain setting in most of the cases. Only pretraining on the smaller WebNLG does not help for the larger AGENDA.

In general, the experiments suggest that exploring additional pretraining for graph-to-text generation tasks can improve the models’ performance even if the data do not come from the same domain. Therefore, we hope that our newly collected datasets will be valuable for future work on graph-to-text generation.

7 Graph Structure and Pretrained Models

7.1 Bag of Entities and Relations

To further examine to what extent the graph structure is used by the pretrained models, we conduct additional experiments where the graph structure is only encoded by the order of node labels, i.e., we remove parentheses in AMRs and replace , , and tags by neutral separators for KGs. In this way, the graph structure is completely obscured if we shuffle the node labels. In particular, we investigate two model variants: a T5 model finetuned on ordered nodes, and another one finetuned on shuffled nodes. In Table 9, we compare the effects on the performance of the T5 models when the graph structure is lost during training or evaluation.

AMR17
Train Eval BLEU METEOR chrF++
order order 36.83 38.17 65.57
shuf order 17.02 27.79 51.00
order shuf 05.18 19.63 39.97
shuf shuf 15.56 27.23 50.20
WebNLG-Seen
Train Eval BLEU METEOR chrF++
order order 63.41 44.98 76.56
shuf order 61.54 44.19 75.55
order shuf 54.08 40.63 70.77
shuf shuf 61.07 44.26 75.58
AGENDA
Train Eval BLEU METEOR chrF++
order order 19.86 21.73 44.90
shuf order 19.03 21.40 44.31
order shuf 19.04 21.53 44.60
shuf shuf 19.08 21.49 44.37
Table 9: Impact of using a bag of entities and relations (shuf) as input for training or evaluation on AMR17, WebNLG-Seen, and AGENDA datasets. T5-Small was used for this experiment.

We observe that the AMR-to-text performance drops significantly when the graph structure is missing. KG-to-text performance, however, does not decrease much, indicating that most of the PLMs’ success in this task stems from their language modeling rather than their graph encoding capabilities. It has recently been argued that large PLMs acquire a certain amount of world knowledge during pretraining (petroni-etal-2019-language). We hypothesize that this knowledge makes it easier to recover KG facts based on a set of entities and relations than reconstructing a corrupted AMR.

7.2 Qualitative Analysis

To further test our hypothesis that pretrained models make use of their world knowledge during KG-to-text generation, we take example facts from the WebNLG dev set, corrupt them, and feed them to the two T5 models introduced in the previous section. Table 8 shows the generations.

Note that the model trained on correctly ordered nodes inside the triples has learned a bit more to rely on the input graph structure. The false fact in example (1) is reliably transferred to the text by order but not by shuf, which silently corrects it. But even order is not completely free from its world knowledge bias, as illustrated in example (2) where both models refuse to generate an incorrect fact. This indicates that world knowledge is a strong guide during text generation even for models that were fine-tuned with the graph structure clearly marked. Interestingly, neither model corrects the wrong input in (3). The fact that Leinster is a region in Ireland and not, e.g., a neighborhood of the city Dublin is probably unknown to T5. It seems that T5 falls back to the order of words in the input in such a case. Examples (4)–(7) also illustrate this behavior nicely. While the well-known entities “Rome” and “Italy” produce a similar behavior as “Ohio” and “Cleveland”, i.e., order complies with generating a false statement and shuf rather follows its world knowledge, lowercasing the entity names changes that. In case of the unknown entities “rome” and “italy”, both models fall back to the order of the input to produce their texts.

8 Conclusion

We have studied the problem of generating text from graphs, employing two PLMs: BART and T5. Interestingly, these models achieve state-of-the-art results on different graph-to-text benchmarks, even though they do not explicitly encode the graph structure. Second, we demonstrate that task-adaptive pretraining improves the performance of PLMs for graph-to-text generation. Third, we examine to what extent the graph structure is considered for the text generation process, and our experiments suggest that prior world knowledge is a strong guide for these models. Future work should explore ways to inject the graph structure more intimately into large PLMs, thus possibly both leveraging their strong language modeling capabilities and keeping the output more faithful to the input.

References