Graphs are important structures in natural language processing (NLP) as they represent complex relations between a set of objects. For example, syntactic and semantic structures of sentences can be represented using different graph representationsbastings-etal-2017-graph; banarescu-etal-2013-abstract and Knowledge Graphs (KGs) are used to encode factual knowledge in the form of relations between entities gardent-etal-2017-webnlg.
Graph-to-text generation, a subtask of data-to-text generation 10.5555/3241691.3241693, aims to create meaningful and coherent natural language text to describe an input graph. Recent efforts for graph-to-text generation song-etal-acl2018; damonte-cohen-2019-structural; ribeiro-etal-2019-enhancing; ribeiro2020modeling
focus on effectively encoding the input graph, employing graph encoders usually built upon Graph Neural Networks (GNN)Kipf:2016tc; NIPS2017_6703 or Transformers NIPS2017_7181. These models better represent input graphs than standard text-to-text models where input graphs are linearized, neglecting structural graph information konsas_17; colin-gardent-2018-generating. In graph-to-text models, the encoder computes representations for a graph-based input that should be invariant to the node order, whereas the decoder generates text as a linear chain structure. The complex nature of graphs increases the difficulty of defining alignments between the source nodes/edges and the target tokens, broadening the structural gap between the encoder and decoder zhao-etal-2020-bridging.
Transfer learning has become ubiquitous in NLP and pretrained Transformer-based architectures considerably have outperformed prior state of the art devlin-etal-2019-bert; liu2020roberta; radford2019language. Following this trend, very recent work mager2020gpttoo; harkous2020text; kale2020texttotext; radev2020dart apply transfer learning for data-to-text generation, where a language model is first pretrained on massive corpora before being fine-tuned on the target task. More interestingly, some of these approaches are successfully employed in graph-to-text generation, even though they do not explicitly encode the graph structure.
In this paper, we investigate pretrained language model (PLM) approaches for graph-to-text generation. We present a study across three graph domains (meaning representations, Wikipedia KGs and scientific KGs) and show strong results even though the models do not encode any graph-specific structural bias. In particular, we examine two recent text-to-text models: BART lewis2019bart and T5 2019t5. Those architectures are suitable for conditional text generation as they both follow a Transformer encoder-decoder architecture. Their main differences are different pretraining strategies and the datasets, on which they were trained.
Our contributions are three fold:
We execute extensive experiments with BART and T5 and show that PLMs consistently outperform recent specialized graph-to-text models on three benchmarks for graph-to-text generation.
We collect additional task-specific data and propose a supervised and unsupervised task-adaptive pretraining, which further improves the performance on two graph-to-text benchmarks. Additionally, we will release these datasets.
We show that even though encoding the structural graph bias improves results in previous models that were trained from scratch, PLMs even perform well when trained on a shuffled version of the graph input. In further analysis, we show that the knowledge acquired during pretraining gives PLMs such a significant advantage on certain graph-to-text benchmarks that they do not need to understand the input graph structure for good performance.
2 Related Work
Graph-to-text generation can be divided into two main tasks: (i) MR-to-text generation - generating text from meaning representations konsas_17 and (ii) KG-to-text generation - generating text from knowledge graphs (KG) gardent-etal-2017-webnlg.
Abstract Meaning Representation (AMR; banarescu-etal-2013-abstract) is a semantic formalism that represents the meaning of a sentence as a rooted directed graph expressing “who is doing what to whom”. In an AMR graph, nodes represent concepts and edges represent semantic relations. Various neural models have been proposed to generate sentences from AMR graphs. konsas_17 propose the first neural approach for AMR-to-text generation that uses a linearized input graph. song-etal-acl2018 and beck-etal-2018-acl2018 propose GNNs based on recurrent mechanisms to directly encode the AMR graph structure. damonte-cohen-2019-structural investigate combinations of graph convolutional networks (GCN) and LSTMs in order to encode the AMR nodes. ribeiro-etal-2019-enhancing develop a graph encoder that computes a top-down and a bottom-up representation of the AMR graph whereas dcgcnforgraph2seq19guo employ dense connections between GNN layers. Recent methods zhu-etal-2019-modeling; cai-lam-2020-graph; doi:10.116200297; song-etal-2020-structural employ Transformers to learn node representations injecting the graph structure into the self-attention aggregation.
Recent neural approaches for KG-to-text generation linearize the KG triples ignoring the graph structure moryossef-etal-2019-step; castro-ferreira-etal-2019-neural. trisedya-etal-2018-gtr develop an LSTM encoder that encodes relationships within and between triples. marcheggiani-icnl18 propose to employ GNNs to capture node contexts, and demonstrate superior performance compared to LSTMs. koncel-kedziorski-etal-2019-text propose a Transformer-based approach which encodes the input graph by computing node representations based on the node context of direct neighbors. Recently, ribeiro2020modeling propose a unified Graph Attention Network (GAT) framework that encodes both global and node contexts in order to better capture the graph topology.
Pretraining Transformer-based methods, such as BERT devlin-etal-2019-bert, GPT-2 radford2019language, XLNet NIPS2019_8812, or RoBERTa liu2020roberta, have established a qualitatively new level of baseline performance for many widely used Natural Language Understanding (NLU) benchmarks, including the popular GLUE wang-etal-2018-glue. mager2020gpttoo is the first approach that employs a pretrained Transformer-based language model (GPT-2) for AMR-to-text generation. Very recently, harkous2020text and kale2020texttotext demonstrate state-of-the-art results in different data-to-text benchmarks, employing GPT-2 and T5 models respectively. Concurrent to our work, radev2020dart propose DART, a large dataset for data-to-text generation, and employ BART in the WebNLG dataset, augmenting the training data with DART and achieving good performance in the out-of-domain setting.
3 Fine-tuning Pretrained Models
In this paper, we investigate two PLMs based on the Transformer encoder-decoder architecture NIPS2017_7181: BART and T5. The main differences between these two models are how they are trained and the input corpora for pretraining. Our main motivation for using BART and T5 is that their encoder-decoder architectures are particularly flexible and thus likely well-suited for conditioned text generation tasks.
is pretrained as a text-to-text denoising autoencoder. In particular, in BART, ReLU activation functions are replaced by GeLUs. The pretraining stage has two phases. First, the input text is corrupted with an arbitrary noising function, and, second, a text-to-text model is learned in order to reconstruct the original text. The model is pretrained on a combination of books and Wikipedia data. In our experiments, we evaluate the impact of the model capacity with two versions:base with 140M parameters and large with 400M parameters.
The T5 2019t5
model aims to convert different problems into a text-to-text format. T5 follows the original Transformer architecture with a modification regarding the positional embeddings. Instead of using a fixed embedding for each position, they employ a simplified form of position embeddings where each embedding is simply a scalar that is added to the corresponding logit used for computing the attention weights. T5 is pretrained on a clean up version of Common Crawl’s web and demonstrated state-of-the-art performance on datasets such as CNN/Daily Mailchen-etal-2016-thorough and SuperGLUE NIPS2019_8589. We experiment with T5 models with different capacities: small with 60M, base with 220M, and large with 770M parameters.
|based on pretrained language models|
In order to explore transfer learning with large PLMs for graph-to-text generation, we train BART and T5 for some epochs on a supervised downstream dataset, in a process referred to as fine-tuning. We also experiment with an intermediate pretraining step before the fine-tuning phase. In this intermediate pretraining step, we train the models with additional task-specific data, employing supervised and unsupervised approaches (see Section6). For T5, in the supervised setup, we add a prefix “translate from Graph to Text:” before the graph input. We add this prefix to imitate the setup used during T5’s original pretraining, when translating between different languages. The motivation for adding this kind of prefix is that it could help the model to identify the new task as a form of translation and to better distinguish between graphs and natural language texts.
In this work, we evaluate text-to-text pretrained models in three graph-to-text benchmarks: AMR17 (LDC2017T10222Downloaded from https://amr.isi.edu/download.html.), WebNLG gardent-etal-2017-webnlg, and AGENDA koncel-kedziorski-etal-2019-text. Table 2 shows statistics for each dataset.
The LDC2017T10 corpus contains instances with a sentence annotated with its corresponding AMR graph. The AMR graphs are processed and represented using PENMAN notation (see Figure 1) as it was shown to have better performance compared to other processing techniques mager2020gpttoo. We use the preprocessing script by ribeiro-etal-2019-enhancing to linearize the graph.
Each instance of WebNLG contains a KG extracted from DBPedia333DBPedia is constructed extracting structured content from Wikipedia. 10.5555/1785162.1785216 in the form of triples and a target text, consisting of one or multiple sentences, describing the graph. The test set is divided into three partitions: seen, which contains only DBPedia categories presented in the train set; unseen, which covers categories never seen during training; and all, which includes categories from both seen and unseen sets. Similarly to previous work harkous2020text; kale2020texttotext, we linearize the triples adding special tokens. In particular, we prepend the tokens , , and before the head entity, the relation and tail entity of a triple, respectively.
This dataset is constructed from the scientific AI domain. In particular, KGs are paired with scientific abstracts extracted from proceedings of 12 top AI conferences. Each sample contains the paper title, a KG and the corresponding paper abstract. A KG contains entities corresponding to scientific terms, which are often multi-word expressions, and the edges represent relations between the entities. This dataset has loose alignments between the input graph and the corresponding text as the graphs were automatically generated by employing an information extraction system koncel-kedziorski-etal-2019-text. The input for the models is a text containing the title, a sequence of all KG entities, and the linearized triples. The target text is the paper abstract. We add special tokens into the triples in the same way as for WebNLG. In contrast to WebNLG, we had to list all entities given in the input additionally to listing all the triples because AGENDA contains a lot of instances with isolated entities, i.e., entities that do not occur in any triple. In WebNLG, this is never the case.
|based on pretrained language models|
We develop our experiments based on BART and T5 pretrained models released by wolf2019huggingfaces. We use Adam optimization kingma:adam as the optimizer with an initial learning rate of . We do not employ warm-up in our learning rate schedule. For decoding, we use beam search with the beam sizes of 5, 3, and 5 for AMR17, WebNLG, and AGENDA, respectively. Following previous works, we evaluate the results using BLEU Papineni:2002:BMA:1073083.1073135, METEOR Denkowski14meteoruniversal and CHRF++ popovic-2015-chrf automatic metrics.444For BLEU, we use the multi-BLEU script from the MOSES decoder suite Koehn:2007:MOS:1557769.1557821. For METEOR, we use the original meteor-1.5.jar script (https://github.com/cmu-mtlab/meteor) We also evaluate the results using MoverScore zhao-etal-2019-moverscore, BERTScore bert-score and BLEURT sellam-etal-2020-bleurt metrics, as they employ contextual embeddings to incorporate semantic knowledge. We report the test results when the BLEU score on development set is optimal.
5.1 Results on AMR-to-Text
Table 1 shows the results of our experiments and several recent results reported on the AMR17 test set. Note that we also report results from models that leverage PLMs. mager2020gpttoo and harkous2020text employ the GPT-2 model in their approaches. Note whereas GPT-2 is a Transformer-based decoder, our approaches employ BART and T5 that follow a Transformer-based encoder-decoder architecture. T5-Large achieves a considerable improvement of 5.48 BLEU and 1.2 METEOR scores over the previous state-of-the-art model that uses GPT2. BART achieves the new state-of-the-art BLEU score of 43.72, 15.9% higher. Note that the semantic metrics follow similar trends.
5.2 Results on WebNLG
We compare the performance of our experiments with many state-of-the-art results reported on this dataset. Table 3 shows the results for the WebNLG test set. Melbourne is one of the best competitors in the WebNLG challenge. moryossef-etal-2019-step and castro-ferreira-etal-2019-neural develop neural pipeline approaches with explicit intermediate steps in the generation process and achieve strong performance in the unseen dataset. On the other hand, fully end-to-end models ribeiro2020modeling; schmitt2020modeling have strong performance on seen dataset and usually perform poorly in unseen data. A very recent model zhao-etal-2020-bridging achieves the best performance in the seen test set. However, this model leverages additional information about the order that the triples are realized into the target text. T5 achieves 59.29 and 54.69 BLEU points on all seen and unseen datasets. Note that kale2020texttotext also employ T5 in this benchmark. In particular, in our T5 setup, we add a prefix before the input graph (see section 3). BART performs much worse than T5 on this dataset, with around 10 BLEU points lower compared to T5 in all seen partition.
5.3 Results on AGENDA
In Table 4 we show the results for the AGENDA test set. We compare our approaches against several recent results reported on this dataset. It is worth mentioning that AGENDA graphs are incomplete regarding the target text. In contrast to WebNLG, this dataset is from scientific domain and the KGs and texts are extracted from the AI domain koncel-kedziorski-etal-2019-text. The models also have strong performance on this dataset. We believe that their capacity to generate fluent text helps when generating paper abstracts, even though they were not pretrained in the scientific domain. We note that an intermediate task-adaptive pretraining can enable the models to achieve better performance. Finally, note that BART has impressive performance with its large variant achieving a score of 23.65 BLEU points, 5.6 points higher than the previous state of the art.
6 Task-adaptive Pretraining
Following recent previous work gururangan-etal-2020-dont, we would like to investigate whether leveraging additional task-specific data can further improve the performance on graph-to-text generation. Task-specific data refers to a smaller pretraining corpus that is more task-relevant. In order to leverage the task-specific data, we add an intermediate pretraining step between the original pretraining and fine-tuning phases for graph-to-text generation. Figure 2 shows the proposed training pipeline strategy.
More precisely, we first continue pretraining BART and T5 using an unsupervised or supervised training. In the supervised approach, we use pairs of graphs and corresponding texts collected from the same or similar domain as the target task. In the unsupervised approach, we follow BART and T5 pretraining strategies, corrupting the target text. Note that we do not use the graphs in the unsupervised pretraining, but only the target text of our task-specific data collections. In particular, we randomly mask text spans, replacing 15% of the tokens.555Please, refer to lewis2019bart and 2019t5 for details about the unsupervised pretraining strategies. Finally, we fine-tune the models using the original training set.
We employ this strategy for two graph domains: meaning representations (AMR17) and scientific data (AGENDA), in which we collected additional task-specific data.
AMR Silver Data.
In order to generate additional data for AMR, we sample 200K Gigaword666https://catalog.ldc.upenn.edu/LDC2003T05. sentences and use JAMR777https://github.com/jflanigan/jamr. flanigan-etal-2016-cmu to parse them into AMR graphs. Note that the newswire data in Gigaword is also one of the data sources in the AMR17 benchmark. We use the 200K silver graphs to pretrain the models in a supervised way before fine-tuning them on gold AMR graphs. For the unsupervised pretraining, we only use the sentences.
Semantic Scholar AI Data.
We collect titles and abstracts of around 190K scientific papers from the Semantic Scholar ammar-etal-2018-construction taken from the proceedings of 36 top Computer Science/AI conferences. We construct KGs from the paper abstracts employing DyGIE++ wadden-etal-2019-entity, an automatic information extraction system. Note that the AGENDA dataset was constructed using the SciIE system luan-etal-2018-multi, which also extracts KGs from AI scientific papers. However, in our new dataset, the domain is broader as we collected data from 36 conferences compared to 12 from AGENDA. Furthermore, to prevent data leakage, all AGENDA samples used for performance evaluation are removed from the dataset. We will call this dataset KGAIA (KGs from AI Abstracts). Table 5 shows the dataset statistics.
compares the models’ performance with either no additional pretraining, with additional unsupervised pretraining solely on the target texts of the respective in-domain corpus (TSU), and with additional task-adaptive supervised pretraining (TSS), before being fine-tuned on the respective train sets. Unsupervised pretraining helps less than supervised pretraining but also brings gains in most cases. This suggests that the performance increases after additional supervised pretraining do not only come from seeing more task-specific target texts but that the models learn how to handle graphs and the graph-to-text correspondence. Also note that we use the same number of datapoints for unsupervised and supervised experiments. It is probable that unsupervised pretraining on larger amounts of in-domain data would improve the models even more.
When fine-tuned directly after the original pretraining, BART has better performance than T5. However, when pretraining on the task-specific data, T5 outperforms BART by 3.3 BLEU points, achieving 48.85 BLEU points, the new state of the art for the AMR-to-text generation.
BART benefits more from the task-adaptive pretraining, achieving the new state of the art of 25.66 BLEU points, a gain of 2 BLEU points. The improvements from task-adaptive pretraining are not as large as in the AMR dataset. We believe that this is due to the fact that the input graphs do not completely cover the target text, making this dataset more challenging.
|Pretrained on||Evaluated on||BLEU|
6.2 Cross-domain Pretraining
|(1)||F||Ohio||is Part Of||Cleveland||Ohio is part of Cleveland.||Cleveland is part of Ohio.|
|(2)||F||United States||is Part Of||Amarillo, Texas||Amarillo, Texas is part of||Amarillo, Texas is part of|
|the United States.||the United States.|
|(3)||F||Leinster||is Part Of||Dublin||Leinster is part of Dublin.||Leinster is part of Dublin.|
|(4)||F||Rome||capital||Italy||Rome’s capital is Italy.||Rome is the capital of Italy.|
|(5)||T||Italy||capital||Rome||Italy’s capital is Rome.||Rome is the capital of Italy.|
|(6)||F||rome||capital||italy||The capital of rome is italy.||Italy is the capital of rome.|
|(7)||T||italy||capital||rome||Italy’s capital is rome.||Italy’s capital is rome.|
We also investigate how cross-domain pretraining affects the performance after fine-tuning. Table 7 shows the results executed using BART. While the target texts in KGAIA and AGENDA share the domain of scientific abstracts, texts in WebNLG are more general. The graphs in all three datasets are KGs, but WebNLG graphs do not share any relations with the other KGs. Nevertheless, supervised pretraining increases the performance in the cross-domain setting in most of the cases. Only pretraining on the smaller WebNLG does not help for the larger AGENDA.
In general, the experiments suggest that exploring additional pretraining for graph-to-text generation tasks can improve the models’ performance even if the data do not come from the same domain. Therefore, we hope that our newly collected datasets will be valuable for future work on graph-to-text generation.
7 Graph Structure and Pretrained Models
7.1 Bag of Entities and Relations
To further examine to what extent the graph structure is used by the pretrained models, we conduct additional experiments where the graph structure is only encoded by the order of node labels, i.e., we remove parentheses in AMRs and replace , , and tags by neutral separators for KGs. In this way, the graph structure is completely obscured if we shuffle the node labels. In particular, we investigate two model variants: a T5 model finetuned on ordered nodes, and another one finetuned on shuffled nodes. In Table 9, we compare the effects on the performance of the T5 models when the graph structure is lost during training or evaluation.
We observe that the AMR-to-text performance drops significantly when the graph structure is missing. KG-to-text performance, however, does not decrease much, indicating that most of the PLMs’ success in this task stems from their language modeling rather than their graph encoding capabilities. It has recently been argued that large PLMs acquire a certain amount of world knowledge during pretraining (petroni-etal-2019-language). We hypothesize that this knowledge makes it easier to recover KG facts based on a set of entities and relations than reconstructing a corrupted AMR.
7.2 Qualitative Analysis
To further test our hypothesis that pretrained models make use of their world knowledge during KG-to-text generation, we take example facts from the WebNLG dev set, corrupt them, and feed them to the two T5 models introduced in the previous section. Table 8 shows the generations.
Note that the model trained on correctly ordered nodes inside the triples has learned a bit more to rely on the input graph structure. The false fact in example (1) is reliably transferred to the text by order but not by shuf, which silently corrects it. But even order is not completely free from its world knowledge bias, as illustrated in example (2) where both models refuse to generate an incorrect fact. This indicates that world knowledge is a strong guide during text generation even for models that were fine-tuned with the graph structure clearly marked. Interestingly, neither model corrects the wrong input in (3). The fact that Leinster is a region in Ireland and not, e.g., a neighborhood of the city Dublin is probably unknown to T5. It seems that T5 falls back to the order of words in the input in such a case. Examples (4)–(7) also illustrate this behavior nicely. While the well-known entities “Rome” and “Italy” produce a similar behavior as “Ohio” and “Cleveland”, i.e., order complies with generating a false statement and shuf rather follows its world knowledge, lowercasing the entity names changes that. In case of the unknown entities “rome” and “italy”, both models fall back to the order of the input to produce their texts.
We have studied the problem of generating text from graphs, employing two PLMs: BART and T5. Interestingly, these models achieve state-of-the-art results on different graph-to-text benchmarks, even though they do not explicitly encode the graph structure. Second, we demonstrate that task-adaptive pretraining improves the performance of PLMs for graph-to-text generation. Third, we examine to what extent the graph structure is considered for the text generation process, and our experiments suggest that prior world knowledge is a strong guide for these models. Future work should explore ways to inject the graph structure more intimately into large PLMs, thus possibly both leveraging their strong language modeling capabilities and keeping the output more faithful to the input.