Text-to-Text Pre-Training for Data-to-Text Tasks

05/21/2020 ∙ by Mihir Kale, et al. ∙ Google 0

We study the pre-train + fine-tune strategy for data-to-text tasks. Fine-tuning T5 achieves state-of-the-art results on the WebNLG, MultiWoz and ToTTo benchmarks. Moreover, the models are fully end-to-end and do not rely on any intermediate planning steps, delexicalization or copy mechanisms. T5 pre-training also enables stringer generalization, as evidenced by large improvements on out-of-domain test sets. We hope our work serves as a useful baseline for future research, as pre-training becomes ever more prevalent for data-to-text tasks.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Natural language generation from structured data, or data-to-text Kukich (1983); McKeown (1985), is the task of generating a textual description conditioned on source content provided in the form of structured data such as a table, graph etc. Some examples of its applications include task oriented dialog Wen et al. (2015), creating summaries from weather forecasts Sripada et al. etc.

In this work we study the applicability of large scale transfer learning learning for this task. We use the term ”pre-train + fine-tune” to refer to the paradigm of first pre-training a high capacity model on massive text corpora before fine-tuning it on a downstream task. Our study shows that such form of transfer learning, which is now ubiquitous in many areas of NLP

Devlin et al. (2018), works well for text generation from structured data as well. In particular, we focus on pre-training in form of the “Text-to-Text Transfer Transformer” (T5) models released by Raffel et al. (2019).

Fine-tuning T5 achieves state-of-the-art results on three diverse benchmarks spanning task oriented dialogue (MultiWoz), tables-to-text (ToTTo) and graph-to-text (WebNLG). Empirical results further suggest the following:

  • Transfer learning greatly improves robustness of models to out-of-domain inputs.

  • T5 outperforms alternatives like BERT Devlin et al. (2018) and GPT-2 Radford et al. (2019).

  • By leveraging pre-training, a single end-to-end model can outperform sophisticated, multi-stage pipelined approaches.

  • With the addition of pre-training, simple transformer Vaswani et al. (2017)

    models exceed the performance of more exotic architectures (eg: pointer networks, graph neural networks) specifically tailored for data-to-text generation.

Our approach is simple, only scratching the surface of what is possible. There is much to be explored in the space of leveraging unlabelled data, developing unsupervised objectives etc. that are more tailored for generating text from structured data. We hope our work serves as a useful baseline for future research, as pre-training becomes ever more prevalent for this task.

2 Related Work

Transfer Learning Devlin et al. (2018), Howard and Ruder (2018) showed that unsupervised pre-training can greatly benefit tasks like text classification, question answering, summarization etc. In particular, Raffel et al. (2019) perform a large scale study of different training objectives, model capacity and size of data. Peng et al. (2020) and Chen et al. (2019b) show that pre-training in the form of GPT-2 can indeed improve performance on data-to-text task as well. Our experiments show that pre-training with T5, where both encoder and decoder are trained using a span masking objective, performs significantly better than encoder-only alternatives such as BERT and GPT-2. Some works have also studied pre-training via supervised objectives, such as machine translation Siddhant et al. (2019); Kale and Roy (2020) and reading comprehension Khashabi et al. (2020).

Data-to-Text Early work on data-to-text focused on rule-based pipelined methods, while recent works have adopted neural approaches. Wen et al. (2015) proposed the Semantically Controlled LSTM and were one of the first to show that neural networks can be successfully applied to this problem. Liu et al. (2018) generate text by conditioning language models on tables, Puduppully et al. (2019) explicitly model entities and Marcheggiani and Perez-Beltrachini (2018) encode structured data using graph convolutional networks. Ferreira et al. (2019) find that neural pipelined approaches perform better than end-to-end models. This notion is echoed Moryossef et al. (2019) who show the effectiveness of adding an explicit planning stage prior to generation.

Figure 1: Examples from each dataset - The first row is WebNLG, second is Multiwoz and third is ToTTo. Each row illustrates the structured data (left), its linearized representation (top) and the target text(bottom)

3 Pre-training

We rely on the T5 pre-trained models released by Raffel et al. (2019). They consist of a transformer based encoder-decoder architecture. These models were pre-trained in a multitask fashion with an unsupervised “span masking” objective on the C4 dataset as well as supervised translation, summarization, classification, and question answering tasks. Note that none of the supervised tasks include language generation from structured data. Disentangling the effects of unsupervised and supervised objectives is in interesting area for future work.

To study the impact of model capacity, we experiment with different T5 variants - Small (60 million parameters), Base (220 million), Large (770 million) and 3B (3 billion).

4 Fine-tuning

Our modeling approach is simple. The data-to-text task is cast in the text-to-text framework by representing the structured data as a flat string (linearization). Figure 1 shows examples of the input representation for each dataset.

We then fine-tune T5 on the data-to-text corpus for a small number of steps. The maximum training steps is set to 5K for MultiWoz and WebNLG, while the larger ToTTo dataset is trained for 10K steps. All the model parameters are updated in the fine-tuning process.

5 Experimental Setup

The T5 vocabulary consists of 32,000 sentencepieces. Following Raffel et al. (2019), models are fine-tuned with a constant learning rate of 0.001.

The best checkpoint is chosen based on the bleu score on the development set. Decoding is done via greedy search. For model development, we compute BLEU Papineni et al. (2002) scores using sacrebleu Post (2018). In the final evaluation, for each dataset we rely on metrics used by prior work.

6 Datasets

We conduct experiments on 3 English datasets spanning a variety of domains.

  • ToTTo Parikh et al. (2020) consists of Wikipedia tables paired with natural language descriptions. The input is a table with a subset of cells highlighted. A model must generate text that describes the highlighted cells. In this work, we use only the highlighted cells and metadata as input (as opposed to the full table).

  • MultiWoz Budzianowski et al. (2018) is a corpus of  10K human-human dialogs for developing task oriented dialogue systems. For the NLG task, a meaning representation encapsulating system actions must be verbalized into natural language response. The meaning representation consists of dialog acts (inform, request etc.) and list of slot key-value pairs.

  • WebNLG Gardent et al. (2017), where the task is to convert a graph of subject-object-predicate triples into a textual description.

Each dataset uses a different kind of structured data (tables, meaning representations and graph/triples). Table 1 lists the sizes of the three datasets and Figure 1 shows examples for each.

Dataset Train Dev Test
WebNLG 18.1K 2268 4928
ToTTo 120K 7700 7700
Multiwoz 56.8K 7374 7372
Table 1: Dataset sizes.

7 Results and Discussion

Overall Seen Unseen Overall Seen Unseen
Melbourne 45.1 54.5 33.3 0.37 0.41 0.33
GTR-LSTM 37.1 54.0 29.2 0.31 0.37 0.28
Transformer 51.7 56.4 38.9 0.32 0.41 0.21
Step-by-Step 47.4 53.3 34.4 0.39 0.44 0.34
PlanEnc 52.8 64.4 38.2 0.41 0.45 0.37
T5-Small 52.0 62.6 38.8 0.41 0.45 0.37
T5-Base 55.2 64.7 49.4 0.43 0.46 0.41
T5-Large 57.1 63.9 52.8 0.44 0.46 0.41
T5-3B 54.0 62.8 52.0 0.43 0.45 0.42
Table 2: Results on WebNLG. Metrics as reported in Zhao et al. (2020)

7.1 WebNLG

The evaluation is done using BLEU and METEOR Lavie and Agarwal (2007), similar to Ferreira et al. (2019). The test set is split into two parts - seen and unseen. The examples in the unseen set are drawn from domains not present in the training set. It also features roughly 100 relations not seen during training.
Some of the baselines we compare with are:

  • Melbourne, a neural encoder-decoder approach, which scored the highest in the automatic evaluation of the WebNLG challenge Gardent et al. (2017). The model relies on delexicalization, where entities are replaced with placeholders.

  • GTR-LSTM Distiawan et al. (2018), which employs a graph based triple encoder.

  • Step-by-Step Moryossef et al. (2019) which splits the generation procedure into a planning stage followed by a neural generation stage.

  • Pipeline-Transformer Ferreira et al. (2019), a pipelined neural system consisting of discourse ordering, text structuring, lexicalization and referring expression generation.

  • PlanEnc Zhao et al. (2020), the current state-of-the-art system. It consists of a graph convolution network based planning model which first predicts the order of the triples. This is followed by an LSTM with attention and copy mechanism to generate the text. To train the planning model, the approach relies on extra annotations for the triple ordering. Such annotations are can be expensive and time consuming to obtain, especially for large, complex inputs.

Results are reported in Table 2, for the overall test set as well as the seen and unseen splits. T5-Large performs the best across BLEU as well as METEOR. It and improves over PlanEnc by 4.3 BLEU on the overall test set. It also displays excellent generalization to new domains and relations, with a 14 BLEU improvement on the unseen test set. The results indicate that with pre-training, end-to-end neural models can surpass sophisticated pipelined approaches.

All the T5 models perform well on the Seen test set. On the Unseen test set, T5-Small scores substantially lower, indicating that pre-training with large capacity models is required for out-of-domain generalization.

Model Overall Overall Subset Nonoverlap Subset
Content Planner 19.2 29.2 24.5 32.5 13.9 25.8
Pointer-Generator 41.6 51.6 50.6 58.0 32.2 45.2
BERT-to-BERT 44.0 52.6 52.7 58.4 34.8 46.7
T5-3B 49.5 58.4 57.5 62.6 41.4 54.2
Table 3: Results on the ToTTo test set.

7.2 ToTTo

Following Parikh et al. (2020)

, BLEU and PARENT are employed as evaluation metrics for this table-to-text generation task. PARENT is a reference less, word-overlap based metric that reflects the factual accuracy of generated text relative to the structured data.

Dhingra et al. (2019) find that PARENT correlates better with human factual accuracy judgements in comparison to other generation metrics like ROGUE Lin (2004) and METEOR.
The following baseline models are compared:

  • Content Planner Puduppully et al. (2019) - A seq2seq model with separate content planning and generation stages.

  • Pointer Generator See et al. (2017) - An LSTM based seq2seq model with attention and pointer network based copy mechanism.

  • BERT-to-BERT Rothe et al. (2019) - A transformer based encoder-decoder model, where both the encoder and decoder are initialized with BERT.

Notably, ToTTo features a hidden test set, which is split into two halves - Overlap and Non-Overlap. The Non-Overlap test set features examples that are out-of-domain. A submission must be made to the leaderboard in order to get the metrics on the test sets.

Results are reported in Table 3. Our only submission (based on T5-3B 111We used beam search with a width of 10 for the test set submission.), achieves state-of-the-art results, improving upon the BERT based baseline by 5.5 BLEU and 5.8 PARENT. Moreover, the model is more robust to out-of-domain tables, with larger improvements of 6.6 BLEU and 7.5 PARENT on the Non-Overlap test set.

Table 4 reports results on the development set for the different T5 model sizes. T5-Base, which has roughly the same number of parameters as BERT-to-BERT, shows large improvements. (+3.7 BLEU, +4.5 PARENT). Even T5-Small, which has 3x fewer parameters, performs better than BERT.

Model Overall Overlap Subset Non-Overlap Subset
BERT-to-BERT 44.0 52.6 52.7 58.4 34.8 46.7
T5-Small 45.7 55.9 53.9 60.4 37.7 51.6
T5-Base 47.7 57.1 56.1 61.8 39.6 52.6
T5-Large 48.1 57.3 56.8 62.0 39.8 52.8
T5-3B 48.4 57.8 56.7 62.4 40.4 53.3
Table 4: Results on the ToTTo development set for different variants of T5.

7.3 MultiWoz

Evaluation on MultiWoz is done using BLEU and SER (Slot Error Rate). SER is the fraction of examples where at least one slot value from the structured data is not expressed in the predicted response. The metric is noisy since the comparison is done via exact match and does not cover all slots.
We compare with the following baselines:

  • HDSA Chen et al. (2019a) - Hierarchically Disentangled Self-Attention, a transformer based architecture that encodes the dialog acts into a multi-layer hierarchical graph, with disentangled attention heads modeling specific nodes in the dialog act graph.

  • SC-GPT Peng et al. (2020) - A GPT-2 (345M parameters) model that is further pre-trained on a large data-to-text dialog corpus 222Roughly 400,000 examples. and finally fine-tuned on MultiWoz. This 2 stage pre-training approach is currently state-of-the-art for Multiwoz.

Results are reported in Table 5. All T5 based models (including T5-small which has 5x fewer parameters) outperform SC-GPT by 4-5 BLEU without any in-domain pre-training. While the SER scores are slightly worse, upon manual inspection we found that the difference can largely be attributed to false positives arising from a combination of annotation inconsistencies in the dataset coupled with the exact match constraint, which does not account for paraphrases.

HDSA 26.5 12.14
SC-GPT2 30.8 0.53
T5-Small 34.6 1.27
T5-Base 35.1 0.99
T5-Large 34.7 0.92
T5-3B 34.8 0.86
Table 5: Results on Multiwoz.

8 Conclusion and Future Work

In this study we evaluated pre-training in the form of T5 for the data-to-text task. We found that it leads to state-of-the-art results, while greatly improving robustness to out-of-domain inputs. Though we focused on automatic metrics, corroborating our findings via human evaluation is an important next step. In the future, we also hope to design unsupervised pre-training objectives that are specifically tailored for the data-to-text task.


  • P. Budzianowski, T. Wen, B. Tseng, I. Casanueva, S. Ultes, O. Ramadan, and M. Gasic (2018) MultiWOZ-a large-scale multi-domain wizard-of-oz dataset for task-oriented dialogue modelling. In

    Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

    pp. 5016–5026. Cited by: 2nd item.
  • W. Chen, J. Chen, P. Qin, X. Yan, and W. Y. Wang (2019a) Semantically conditioned dialog response generation via hierarchical disentangled self-attention. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 3696–3709. Cited by: 1st item.
  • Z. Chen, H. Eavani, Y. Liu, and W. Y. Wang (2019b) Few-shot nlg with pre-trained language model. arXiv preprint arXiv:1904.09521. Cited by: §2.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: 2nd item, §1, §2.
  • B. Dhingra, M. Faruqui, A. Parikh, M. Chang, D. Das, and W. Cohen (2019) Handling divergent reference texts when evaluating table-to-text generation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 4884–4895. Cited by: §7.2.
  • B. Distiawan, J. Qi, R. Zhang, and W. Wang (2018) GTR-lstm: a triple encoder for sentence generation from rdf data. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1627–1637. Cited by: 2nd item.
  • T. C. Ferreira, C. van der Lee, E. van Miltenburg, and E. Krahmer (2019) Neural data-to-text generation: a comparison between pipeline and end-to-end architectures. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 552–562. Cited by: §2, 4th item, §7.1.
  • C. Gardent, A. Shimorina, S. Narayan, and L. Perez-Beltrachini (2017) The webnlg challenge: generating text from rdf data. In Proceedings of the 10th International Conference on Natural Language Generation, pp. 124–133. Cited by: 3rd item, 1st item.
  • J. Howard and S. Ruder (2018) Universal language model fine-tuning for text classification. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 328–339. Cited by: §2.
  • M. Kale and S. Roy (2020) Machine translation pre-training for data-to-text generation–a case study in czech. arXiv preprint arXiv:2004.02077. Cited by: §2.
  • D. Khashabi, T. Khot, A. Sabharwal, O. Tafjord, P. Clark, and H. Hajishirzi (2020) UnifiedQA: crossing format boundaries with a single qa system. arXiv preprint arXiv:2005.00700. Cited by: §2.
  • K. Kukich (1983) Design of a knowledge-based report generator. In Proceedings of the 21st annual meeting on Association for Computational Linguistics, pp. 145–150. Cited by: §1.
  • A. Lavie and A. Agarwal (2007) METEOR: an automatic metric for mt evaluation with high levels of correlation with human judgments. In Proceedings of the Second Workshop on Statistical Machine Translation, pp. 228–231. Cited by: §7.1.
  • C. Lin (2004) Rouge: a package for automatic evaluation of summaries. In Text summarization branches out, pp. 74–81. Cited by: §7.2.
  • T. Liu, K. Wang, L. Sha, B. Chang, and Z. Sui (2018) Table-to-text generation by structure-aware seq2seq learning. In

    Thirty-Second AAAI Conference on Artificial Intelligence

    Cited by: §2.
  • D. Marcheggiani and L. Perez-Beltrachini (2018) Deep graph convolutional encoders for structured data to text generation. In Proceedings of the 11th International Conference on Natural Language Generation, pp. 1–9. Cited by: §2.
  • K. R. McKeown (1985) Text generation: using discourse strategies and focus constraints to generate natural language text. Cited by: §1.
  • A. Moryossef, Y. Goldberg, and I. Dagan (2019) Step-by-step: separating planning from realization in neural data-to-text generation. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 2267–2277. Cited by: §2, 3rd item.
  • K. Papineni, S. Roukos, T. Ward, and W. Zhu (2002) BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics, pp. 311–318. Cited by: §5.
  • A. P. Parikh, X. Wang, S. Gehrmann, M. Faruqui, B. Dhingra, D. Yang, and D. Das (2020) ToTTo: a controlled table-to-text generation dataset. arXiv preprint arXiv:2004.14373. Cited by: 1st item, §7.2.
  • B. Peng, C. Zhu, C. Li, X. Li, J. Li, M. Zeng, and J. Gao (2020) Few-shot natural language generation for task-oriented dialog. arXiv preprint arXiv:2002.12328. Cited by: §2, 2nd item.
  • M. Post (2018) A call for clarity in reporting bleu scores. arXiv preprint arXiv:1804.08771. Cited by: §5.
  • R. Puduppully, L. Dong, and M. Lapata (2019) Data-to-text generation with content selection and planning. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 6908–6915. Cited by: §2, 1st item.
  • A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever (2019) Language models are unsupervised multitask learners. OpenAI Blog 1 (8). Cited by: 2nd item.
  • C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu (2019) Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv preprint arXiv:1910.10683. Cited by: Text-to-Text Pre-Training for Data-to-Text Tasks, §1, §2, §3, §5.
  • S. Rothe, S. Narayan, and A. Severyn (2019) Leveraging pre-trained checkpoints for sequence generation tasks. arXiv preprint arXiv:1907.12461. Cited by: 3rd item.
  • A. See, P. J. Liu, and C. D. Manning (2017) Get to the point: summarization with pointer-generator networks. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1073–1083. Cited by: 2nd item.
  • A. Siddhant, M. Johnson, H. Tsai, N. Arivazhagan, J. Riesa, A. Bapna, O. Firat, and K. Raman (2019)

    Evaluating the cross-lingual effectiveness of massively multilingual neural machine translation

    arXiv preprint arXiv:1909.00437. Cited by: §2.
  • [29] S. Sripada, E. Reiter, and I. Davy SumTime-mousam: configurable marine weather forecast generator. Cited by: §1.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: 4th item.
  • T. Wen, M. Gasic, N. Mrksic, P. Su, D. Vandyke, and S. Young (2015) Semantically conditioned lstm-based natural language generation for spoken dialogue systems. arXiv preprint arXiv:1508.01745. Cited by: §1, §2.
  • C. Zhao, M. Walker, and S. Chaturvedi (2020) Bridging the structural gap between encoding and decoding for data-to-text generation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Cited by: 5th item, Table 2.