Data-to-text (D2T) generation is the task of generating texts from structured inputs (Reiter and Dale, 1997)
. Previous attempts to solve this task can be classified according to whether separate stages are adopted or not. For example, one method is to generate structured inputs in correct order first, then realize the whole sentence(Puduppully et al., 2019; Wang et al., 2021; Su et al., 2021). Another is to generate the whole sentence in an end-to-end (E2E) manner using Copy mechanism (Gu et al., 2016; See et al., 2017) or just pre-trained models (Kale and Rastogi, 2020).
The two methods each have their advantages and disadvantages. Methods that have separate stages generate more confident texts than E2E models since separate stages first produce entities from structured inputs. However, the output sentence can be awkward or the overall performance may be worse than that of E2E models. This occurs because the generated output of the first stage differs from the gold label that the second stage expects. If the first stage produces a slightly incorrect result, the second stage takes over and increases the error of subsequent sentences. This is often referred to as error propagation. E2E models are free from this vulnerability but it could omit the important entities from structured inputs. To include these important entities, a separate module may be added to E2E models (e.g. Copy mechanism module), but this could generate awkward sentences and degrade system integrity. For this reason, we consider the usage of a pre-trained model without additional modules. A pre-trained E2E Transformer model (e.g. T5 (Raffel et al., 2019)) shows a competitive performance for D2T tasks (Kale and Rastogi, 2020).
Omitting important entities from structured inputs is related to recall values. In D2T, the recall value is a metric that considers not only the target sentence, but also the structured inputs. A high recall indicates that more structured inputs are included. This metric is described in detail in Section 2.
When the same target sentence is repeated, but divided by a special token (i.e. "target_1 <SEP> target_2"), we were able to make two observations. First, we find that Transformer (T5) based model generates asymmetric sentences (Figure 1); i.e., the first part of the output, which is related to target_1, is longer than the second par generated from target_2. Second, the first part of the output covers structured inputs better than the second part. We call this phenomenon Asymmetric Generation.
Asymmetric Generation can be exploited to improve the recall mentioned earlier. Based on our experiments on ToTTo corpus (Parikh et al., 2020) and WIKITABLET (Chen et al., 2020), the first part of asymmetric output shows a higher recall than the second part. It is even higher than the output of the model trained by no-repeated-targets.
We concatenate the first part with a no-repeated-target ("the first part <SEP> no-repeated-target"), then train a new model with lengthened targets (Figure 2). This process can be conducted repeatedly. We call this process Progressive Edit (ProEdit)111The name of our model comes from ProGen (Tan et al., 2020) because it progressively edits the initial output. Our experimental results on ToTTo corpus demonstrate the benefit of ProEdit in achieving the new state-of-the-art on PARENT (Dhingra et al., 2019) metric.
|ToTTo (input=418.43)||WIKITABLET (input=412.34)|
|T5-large||BLEU||P||R||F1222we used the official scripts https://github.com/google-research-datasets/ToTTo||Length (ref=86.4)||BLEU||P||R||F1333we used the official scripts https://github.com/mingdachen/WikiTableT||Length (ref=627)|
2 Related Work
PARENT Metric. The PARENT metric is introduced to evaluate generated texts from structured inputs automatically.
Its precision for n-gram, denoted by, is given by:
where denote the collection of n-grams of order n in and , which are i-th targets and generated texts, respectively. is given by , where is the count of n-gram in and denotes the minimum of its counts in and
. Entailment probability, denoted as, is the most important and further introduced to check whether the presence of an n-gram in a generated text is "correct" given structured inputs. Two models have been introduced to calculate of : the Word overlap model, and the Co-occurrence model. In the most cases, the Word overlap model is used, so we also used it too.
The recall of PARENT is computed against both the target and the table . These are combined for geometric average:
compute the recall of generated sentences against target sentences. is computed as follows:
K denotes the number of records in table and is the number of token in the value string of a record.
denotes the length of the longest common subsequence between x and y. The hyperparameter
can be obtained heuristically using Eq.3 . The key idea is that if the recall of the target against the table is high, it already covers most of the information of structured inputs, so we can assign it a big weight (. We use this method.
3 Progressive Edit of Text
We add the first part of asymmetric generation to an existing target and trained a new model by using this dataset. This process is reiterated so that after the output of the previous model is added to an existing target, a new model is trained by this dataset. As a result the output sentence goes through progressive editing. (Figure 2). ProEdit was repeated until the PARENT metric increased.
4.1 Dataset and Implementation details
We used a pre-trained T5-large model with 737.66M parameters. To test Asymmetric Generation, we used two datasets in D2T: the entire set of ToTTo and the part of WIKITABLET. The ToTTo Dataset is composed of Wikipedia tables paired with human-written descriptions. WIKITABLET, which combines Wikipedia table data with its corresponding Wikipedia sections, is similar to ToTTO, but has a long target text. These datasets were selected since their lengthened targets do not exceed the maximum length of T5 decoder.
On ToTTo, the training set was made up of 120k examples, and the validation set had 7.7k examples. In the case of WIKITABLET, we used 100k and 2.7k samples for training and validation, respectively. As the length of the encoder input was 512. structured inputs exceeding the encoder input were cut to 512. We set the batch size to 2 and learning rate of 5e-5. In the generation phase, we used beam search of size 5, and early stopping with no repeat ngram size 7. Experiments were conducted with two A100 GPUs.
When there were two repeated target sentences (that were divided by a <SEP> token), the generated output was not the same (Table 1); this is the Asymmetric Generation phenomenon. Although target sentences are simply repeated, the resulting sentences differed in length and performances on metrics. On both datasets, the first generated part was longer than the second and performances on metrics were higher for the former than for the after. In addition, the results of the first part on the PARENT metric were even better than when using no-repeated-targets on both datasets.
We put the first part of the output with repeated-targets for the first place of the iteration; this is ProEdit. ProEdit is repeated until the overall F1 score goes up. The result is organized in Table 1. The overall F1 score reached the highest with ProEdit-1-First for ToTTo. For WIKITABLET, ProEdit-1-First F1 score was higher than the no-repeated-target, but lower than the Asymmetric Generation-First. In both datasets, the recall values of the first part increased steadily as ProEdit was repeated.
Our method was evaluated on the test set of ToTTo (Table 2). Using ProEdit-1-First, we achieved the state-of-the-art on PARENT score for the test set.
|Pointer-Generator (See et al., 2017)||41.6||51.6|
|T5-based (Kale and Rastogi, 2020)||49.5||58.4|
|PlanGen (Su et al., 2021)||49.2||58.7|
Recall and Length. In general, a longer sentence increases the recall value. The first part of Asymmetric Generation and ProEdit have longer sentences than the reference. For fair comparison, similar output lengths are generated using beam search and minimum length settings (Table 3). For beam search, the longest sentence is selected. However, selecting the longest sentence in the beam search rather reduced the recall value. In minimum length settings, the probability of the EOS token is zero until the output reaches a certain length. Setting the minimum length improves both length and recall values, but dramatically reduces precision. which led to a steady decrease in the overall F1 score. Our proposed method can improve the overall F1 score since the precision decreases slightly.
Asymmetric Generation. We conducted repeated target sentences experiment using ToTTo on another model: GPT-2(Radford et al., 2019). Asymmetric Generation also occurs (Table 4), so it is not a phenomenon that occurs only in the T5 model.
|Beam size 5||45.8||77.78||49.57||56.86||88.24|
|Beam size 11||44.3||77.02||49.24||56.48||90.65|
|min length 20||45.9||77.78||51.32||58.17||89.27|
|min length 25||40||74.48||52.13||57.61||101.98|
|min length 30||34.6||71.22||52.68||56.74||116.79|
In this paper, we proposed Progressive Edit (ProEdit) process for D2T generation. It utilizes Asymmetric Generation to improve recall. We obtained the new state-of-the-art result on ToTTo.
- WikiTableT: a large-scale data-to-text dataset for generating wikipedia article sections. arXiv preprint arXiv:2012.14919. Cited by: §1.
- Handling divergent reference texts when evaluating table-to-text generation. arXiv preprint arXiv:1906.01081. Cited by: §1.
- Incorporating copying mechanism in sequence-to-sequence learning. arXiv preprint arXiv:1603.06393. Cited by: §1.
- Text-to-text pre-training for data-to-text tasks. arXiv preprint arXiv:2005.10433. Cited by: §1, §1, Table 2.
- Totto: a controlled table-to-text generation dataset. arXiv preprint arXiv:2004.14373. Cited by: §1.
- Data-to-text generation with content selection and planning. In Proceedings of the AAAI conference on artificial intelligence, Vol. 33, pp. 6908–6915. Cited by: §1.
- Language models are unsupervised multitask learners. OpenAI blog 1 (8), pp. 9. Cited by: §5.
Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv preprint arXiv:1910.10683. Cited by: §1.
- Building applied natural language generation systems. Natural Language Engineering 3 (1), pp. 57–87. Cited by: §1.
- Get to the point: summarization with pointer-generator networks. arXiv preprint arXiv:1704.04368. Cited by: §1, Table 2.
- Plan-then-generate: controlled data-to-text generation via planning. arXiv preprint arXiv:2108.13740. Cited by: §1, Table 2.
- Progressive generation of long text with pretrained language models. arXiv preprint arXiv:2006.15720. Cited by: footnote 1.
- Sketch and refine: towards faithful and informative table-to-text generation. arXiv preprint arXiv:2105.14778. Cited by: §1.
Appendix A Examples of generated results
|In 723, Jamri was the King of Sunda.|
|No-Repeated-Target||Sanjaya (r. 723–732) was the ruler of the Sunda Kingdom.|
|Asymmetric Generation-First||In 723, Sanjaya, Harisdarma and Rakeyan Jamri were the rulers of the Sunda Kingdom.|
|As of the census of 2010, there were 3,468 people residing in Herculaneum, Missouri.|
|No-Repeated-Target||The population of Herculaneum was 3,468 at the 2010 census.|
|Asymmetric Generation-First||As of the census of 2010, there were 3,468 people residing in the Herculaneum.|
|ProEdit-1-First||As of the census of 2010, there were 3,468 people residing in Herculaneum, Missouri.|
Appendix B Rules for post-processing
When lengthened targets were given, but divided by a <SEP> token, the output of the model sometimes did not produce a <SEP> token or generated several. We divided the first part and the second part into the following rules.
If <SEP> does not occur:
the first part = generated sentence
the second part = generated sentence
If [SEP] occurs once:
the first part = generated sentence_1
the second part = generated sentence_2
If [SEP] occurs serveral:
the first part = generated sentence_1
the second part = generated sentence_1