Natural language generation from structured data, or data-to-text Kukich (1983); McKeown (1985), is the task of generating a textual description conditioned on source content provided in the form of structured data such as a table, graph etc. Some examples of its applications include task oriented dialog Wen et al. (2015), creating summaries from weather forecasts Sripada et al. etc.
In this work we study the applicability of large scale transfer learning learning for this task. We use the term ”pre-train + fine-tune” to refer to the paradigm of first pre-training a high capacity model on massive text corpora before fine-tuning it on a downstream task. Our study shows that such form of transfer learning, which is now ubiquitous in many areas of NLPDevlin et al. (2018), works well for text generation from structured data as well. In particular, we focus on pre-training in form of the “Text-to-Text Transfer Transformer” (T5) models released by Raffel et al. (2019).
Fine-tuning T5 achieves state-of-the-art results on three diverse benchmarks spanning task oriented dialogue (MultiWoz), tables-to-text (ToTTo) and graph-to-text (WebNLG). Empirical results further suggest the following:
Transfer learning greatly improves robustness of models to out-of-domain inputs.
By leveraging pre-training, a single end-to-end model can outperform sophisticated, multi-stage pipelined approaches.
Our approach is simple, only scratching the surface of what is possible. There is much to be explored in the space of leveraging unlabelled data, developing unsupervised objectives etc. that are more tailored for generating text from structured data. We hope our work serves as a useful baseline for future research, as pre-training becomes ever more prevalent for this task.
2 Related Work
Transfer Learning Devlin et al. (2018), Howard and Ruder (2018) showed that unsupervised pre-training can greatly benefit tasks like text classification, question answering, summarization etc. In particular, Raffel et al. (2019) perform a large scale study of different training objectives, model capacity and size of data. Peng et al. (2020) and Chen et al. (2019b) show that pre-training in the form of GPT-2 can indeed improve performance on data-to-text task as well. Our experiments show that pre-training with T5, where both encoder and decoder are trained using a span masking objective, performs significantly better than encoder-only alternatives such as BERT and GPT-2. Some works have also studied pre-training via supervised objectives, such as machine translation Siddhant et al. (2019); Kale and Roy (2020) and reading comprehension Khashabi et al. (2020).
Data-to-Text Early work on data-to-text focused on rule-based pipelined methods, while recent works have adopted neural approaches. Wen et al. (2015) proposed the Semantically Controlled LSTM and were one of the first to show that neural networks can be successfully applied to this problem. Liu et al. (2018) generate text by conditioning language models on tables, Puduppully et al. (2019) explicitly model entities and Marcheggiani and Perez-Beltrachini (2018) encode structured data using graph convolutional networks. Ferreira et al. (2019) find that neural pipelined approaches perform better than end-to-end models. This notion is echoed Moryossef et al. (2019) who show the effectiveness of adding an explicit planning stage prior to generation.
We rely on the T5 pre-trained models released by Raffel et al. (2019). They consist of a transformer based encoder-decoder architecture. These models were pre-trained in a multitask fashion with an unsupervised “span masking” objective on the C4 dataset as well as supervised translation, summarization, classification, and question answering tasks. Note that none of the supervised tasks include language generation from structured data. Disentangling the effects of unsupervised and supervised objectives is in interesting area for future work.
To study the impact of model capacity, we experiment with different T5 variants - Small (60 million parameters), Base (220 million), Large (770 million) and 3B (3 billion).
Our modeling approach is simple. The data-to-text task is cast in the text-to-text framework by representing the structured data as a flat string (linearization). Figure 1 shows examples of the input representation for each dataset.
We then fine-tune T5 on the data-to-text corpus for a small number of steps. The maximum training steps is set to 5K for MultiWoz and WebNLG, while the larger ToTTo dataset is trained for 10K steps. All the model parameters are updated in the fine-tuning process.
5 Experimental Setup
The T5 vocabulary consists of 32,000 sentencepieces. Following Raffel et al. (2019), models are fine-tuned with a constant learning rate of 0.001.
We conduct experiments on 3 English datasets spanning a variety of domains.
ToTTo Parikh et al. (2020) consists of Wikipedia tables paired with natural language descriptions. The input is a table with a subset of cells highlighted. A model must generate text that describes the highlighted cells. In this work, we use only the highlighted cells and metadata as input (as opposed to the full table).
MultiWoz Budzianowski et al. (2018) is a corpus of 10K human-human dialogs for developing task oriented dialogue systems. For the NLG task, a meaning representation encapsulating system actions must be verbalized into natural language response. The meaning representation consists of dialog acts (inform, request etc.) and list of slot key-value pairs.
WebNLG Gardent et al. (2017), where the task is to convert a graph of subject-object-predicate triples into a textual description.
7 Results and Discussion
The evaluation is done using BLEU and METEOR Lavie and Agarwal (2007), similar to Ferreira et al. (2019). The test set is split into two parts - seen and unseen. The examples in the unseen set are drawn from domains not present in the training set. It also features roughly 100 relations not seen during training.
Some of the baselines we compare with are:
Melbourne, a neural encoder-decoder approach, which scored the highest in the automatic evaluation of the WebNLG challenge Gardent et al. (2017). The model relies on delexicalization, where entities are replaced with placeholders.
GTR-LSTM Distiawan et al. (2018), which employs a graph based triple encoder.
Step-by-Step Moryossef et al. (2019) which splits the generation procedure into a planning stage followed by a neural generation stage.
Pipeline-Transformer Ferreira et al. (2019), a pipelined neural system consisting of discourse ordering, text structuring, lexicalization and referring expression generation.
PlanEnc Zhao et al. (2020), the current state-of-the-art system. It consists of a graph convolution network based planning model which first predicts the order of the triples. This is followed by an LSTM with attention and copy mechanism to generate the text. To train the planning model, the approach relies on extra annotations for the triple ordering. Such annotations are can be expensive and time consuming to obtain, especially for large, complex inputs.
Results are reported in Table 2, for the overall test set as well as the seen and unseen splits. T5-Large performs the best across BLEU as well as METEOR. It and improves over PlanEnc by 4.3 BLEU on the overall test set. It also displays excellent generalization to new domains and relations, with a 14 BLEU improvement on the unseen test set. The results indicate that with pre-training, end-to-end neural models can surpass sophisticated pipelined approaches.
All the T5 models perform well on the Seen test set. On the Unseen test set, T5-Small scores substantially lower, indicating that pre-training with large capacity models is required for out-of-domain generalization.
|Model||Overall||Overall Subset||Nonoverlap Subset|
Following Parikh et al. (2020)
, BLEU and PARENT are employed as evaluation metrics for this table-to-text generation task. PARENT is a reference less, word-overlap based metric that reflects the factual accuracy of generated text relative to the structured data.Dhingra et al. (2019) find that PARENT correlates better with human factual accuracy judgements in comparison to other generation metrics like ROGUE Lin (2004) and METEOR.
The following baseline models are compared:
Content Planner Puduppully et al. (2019) - A seq2seq model with separate content planning and generation stages.
Pointer Generator See et al. (2017) - An LSTM based seq2seq model with attention and pointer network based copy mechanism.
BERT-to-BERT Rothe et al. (2019) - A transformer based encoder-decoder model, where both the encoder and decoder are initialized with BERT.
Notably, ToTTo features a hidden test set, which is split into two halves - Overlap and Non-Overlap. The Non-Overlap test set features examples that are out-of-domain. A submission must be made to the leaderboard in order to get the metrics on the test sets.
Results are reported in Table 3. Our only submission (based on T5-3B 111We used beam search with a width of 10 for the test set submission.), achieves state-of-the-art results, improving upon the BERT based baseline by 5.5 BLEU and 5.8 PARENT. Moreover, the model is more robust to out-of-domain tables, with larger improvements of 6.6 BLEU and 7.5 PARENT on the Non-Overlap test set.
Table 4 reports results on the development set for the different T5 model sizes. T5-Base, which has roughly the same number of parameters as BERT-to-BERT, shows large improvements. (+3.7 BLEU, +4.5 PARENT). Even T5-Small, which has 3x fewer parameters, performs better than BERT.
|Model||Overall||Overlap Subset||Non-Overlap Subset|
Evaluation on MultiWoz is done using BLEU and SER (Slot Error Rate). SER is the fraction of examples where at least one slot value from the structured data is not expressed in the predicted response. The metric is noisy since the comparison is done via exact match and does not cover all slots.
We compare with the following baselines:
HDSA Chen et al. (2019a) - Hierarchically Disentangled Self-Attention, a transformer based architecture that encodes the dialog acts into a multi-layer hierarchical graph, with disentangled attention heads modeling specific nodes in the dialog act graph.
SC-GPT Peng et al. (2020) - A GPT-2 (345M parameters) model that is further pre-trained on a large data-to-text dialog corpus 222Roughly 400,000 examples. and finally fine-tuned on MultiWoz. This 2 stage pre-training approach is currently state-of-the-art for Multiwoz.
Results are reported in Table 5. All T5 based models (including T5-small which has 5x fewer parameters) outperform SC-GPT by 4-5 BLEU without any in-domain pre-training. While the SER scores are slightly worse, upon manual inspection we found that the difference can largely be attributed to false positives arising from a combination of annotation inconsistencies in the dataset coupled with the exact match constraint, which does not account for paraphrases.
8 Conclusion and Future Work
In this study we evaluated pre-training in the form of T5 for the data-to-text task. We found that it leads to state-of-the-art results, while greatly improving robustness to out-of-domain inputs. Though we focused on automatic metrics, corroborating our findings via human evaluation is an important next step. In the future, we also hope to design unsupervised pre-training objectives that are specifically tailored for the data-to-text task.
MultiWOZ-a large-scale multi-domain wizard-of-oz dataset for task-oriented dialogue modelling.
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 5016–5026. Cited by: 2nd item.
- Semantically conditioned dialog response generation via hierarchical disentangled self-attention. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 3696–3709. Cited by: 1st item.
- Few-shot nlg with pre-trained language model. arXiv preprint arXiv:1904.09521. Cited by: §2.
- Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: 2nd item, §1, §2.
- Handling divergent reference texts when evaluating table-to-text generation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 4884–4895. Cited by: §7.2.
- GTR-lstm: a triple encoder for sentence generation from rdf data. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1627–1637. Cited by: 2nd item.
- Neural data-to-text generation: a comparison between pipeline and end-to-end architectures. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 552–562. Cited by: §2, 4th item, §7.1.
- The webnlg challenge: generating text from rdf data. In Proceedings of the 10th International Conference on Natural Language Generation, pp. 124–133. Cited by: 3rd item, 1st item.
- Universal language model fine-tuning for text classification. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 328–339. Cited by: §2.
- Machine translation pre-training for data-to-text generation–a case study in czech. arXiv preprint arXiv:2004.02077. Cited by: §2.
- UnifiedQA: crossing format boundaries with a single qa system. arXiv preprint arXiv:2005.00700. Cited by: §2.
- Design of a knowledge-based report generator. In Proceedings of the 21st annual meeting on Association for Computational Linguistics, pp. 145–150. Cited by: §1.
- METEOR: an automatic metric for mt evaluation with high levels of correlation with human judgments. In Proceedings of the Second Workshop on Statistical Machine Translation, pp. 228–231. Cited by: §7.1.
- Rouge: a package for automatic evaluation of summaries. In Text summarization branches out, pp. 74–81. Cited by: §7.2.
Table-to-text generation by structure-aware seq2seq learning.
Thirty-Second AAAI Conference on Artificial Intelligence, Cited by: §2.
- Deep graph convolutional encoders for structured data to text generation. In Proceedings of the 11th International Conference on Natural Language Generation, pp. 1–9. Cited by: §2.
- Text generation: using discourse strategies and focus constraints to generate natural language text. Cited by: §1.
- Step-by-step: separating planning from realization in neural data-to-text generation. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 2267–2277. Cited by: §2, 3rd item.
- BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics, pp. 311–318. Cited by: §5.
- ToTTo: a controlled table-to-text generation dataset. arXiv preprint arXiv:2004.14373. Cited by: 1st item, §7.2.
- Few-shot natural language generation for task-oriented dialog. arXiv preprint arXiv:2002.12328. Cited by: §2, 2nd item.
- A call for clarity in reporting bleu scores. arXiv preprint arXiv:1804.08771. Cited by: §5.
- Data-to-text generation with content selection and planning. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 6908–6915. Cited by: §2, 1st item.
- Language models are unsupervised multitask learners. OpenAI Blog 1 (8). Cited by: 2nd item.
- Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv preprint arXiv:1910.10683. Cited by: Text-to-Text Pre-Training for Data-to-Text Tasks, §1, §2, §3, §5.
- Leveraging pre-trained checkpoints for sequence generation tasks. arXiv preprint arXiv:1907.12461. Cited by: 3rd item.
- Get to the point: summarization with pointer-generator networks. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1073–1083. Cited by: 2nd item.
Evaluating the cross-lingual effectiveness of massively multilingual neural machine translation. arXiv preprint arXiv:1909.00437. Cited by: §2.
-  SumTime-mousam: configurable marine weather forecast generator. Cited by: §1.
- Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: 4th item.
- Semantically conditioned lstm-based natural language generation for spoken dialogue systems. arXiv preprint arXiv:1508.01745. Cited by: §1, §2.
- Bridging the structural gap between encoding and decoding for data-to-text generation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Cited by: 5th item, Table 2.