AGGGEN: Ordering and Aggregating while Generating

We present AGGGEN (pronounced 'again'), a data-to-text model which re-introduces two explicit sentence planning stages into neural data-to-text systems: input ordering and input aggregation. In contrast to previous work using sentence planning, our model is still end-to-end: AGGGEN performs sentence planning at the same time as generating text by learning latent alignments (via semantic facts) between input representation and target text. Experiments on the WebNLG and E2E challenge data show that by using fact-based alignments our approach is more interpretable, expressive, robust to noise, and easier to control, while retaining the advantages of end-to-end systems in terms of fluency. Our code is available at https://github.com/XinnuoXu/AggGen.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

11/15/2016

End-to-End Neural Sentence Ordering Using Pointer Network

Sentence ordering is one of important tasks in NLP. Previous works mainl...
04/06/2019

Step-by-Step: Separating Planning from Realization in Neural Data-to-Text Generation

Data-to-text generation can be conceptually divided into two parts: orde...
01/27/2021

Neural Sentence Ordering Based on Constraint Graphs

Sentence ordering aims at arranging a list of sentences in the correct o...
12/15/2017

Hierarchical Text Generation and Planning for Strategic Dialogue

End-to-end models for strategic dialogue are challenging to train, becau...
10/21/2021

Video and Text Matching with Conditioned Embeddings

We present a method for matching a text sentence from a given corpus to ...
09/09/2018

Can Neural Generators for Dialogue Learn Sentence Planning and Discourse Structuring?

Responses in task-oriented dialogue systems often realize multiple propo...
06/10/2022

Unsupervised Sentence Simplification via Dependency Parsing

Text simplification is the task of rewriting a text so that it is readab...

1 Introduction

Recent neural data-to-text systems generate text “end-to-end” (E2E) by learning an implicit mapping between input representations (e.g. RDF triples) and target texts. While this can lead to increased fluency, E2E methods often produce repetitions, hallucination and/or omission of important content for data-to-text Dušek et al. (2020) as well as other natural language generation (NLG) tasks Cao et al. (2018); Rohrbach et al. (2018). Traditional NLG systems, on the other hand, tightly control which content gets generated, as well as its ordering and aggregation. This process is called sentence planning Reiter and Dale (2000); Duboue and McKeown (2001, 2002); Konstas and Lapata (2013); Gatt and Krahmer (2018). Figure 1 shows two different ways to arrange and combine the representations in the input, resulting in widely different generated target texts.

In this work, we combine advances of both paradigms into a single system by reintroducing sentence planning into neural architectures. We call our system AggGen (pronounced ‘again’). AggGen jointly learns to generate and plan at the same time. Crucially, our sentence plans are interpretable latent states using semantic facts111Each fact roughly captures “who did what to whom”. (obtained via Semantic Role Labelling (SRL)) that align the target text with parts of the input representation. In contrast, the plan used in other neural plan-based approaches is usually limited in terms of its interpretability, control, and expressivity. For example, in Moryossef et al. (2019); Zhao et al. (2020) the sentence plan is created independently, incurring error propagation; Wiseman et al. (2018) use latent segmentation that limits interpretability; Shao et al. (2019) sample from a latent variable, not allowing for explicit control; and Shen et al. (2020) aggregate multiple input representations which limits expressiveness.

Figure 1: Two different sentence plans with their corresponding generated target texts from our model on the WebNLG dataset. Planning and generation is performed jointly. The dashed line denotes aggregation.

AggGen explicitly models the two planning processes (ordering and aggregation), but can directly influence the resulting plan and generated target text, using a separate inference algorithm based on dynamic programming. Crucially, this enables us to directly evaluate and inspect the model’s planning and alignment performance by comparing to manually aligned reference texts.

We demonstrate this for two data-to-text generation tasks: the E2E NLG Novikova et al. (2017) and the WebNLG Challenge Gardent et al. (2017). We work with a triple-based semantic representation where a triple consists of a subject, a predicate and an object.222Note that E2E NLG data and other input semantic representations can be converted into triples, see Section 4.1. For instance, in the last triple in Figure 1, Apollo 8, operator and NASA are the subject, predicate and object respectively. Our contributions are as follows:

We present a novel interpretable architecture for jointly learning to plan and generate based on modelling ordering and aggregation by aligning facts in the target text to input representations with an HMM and Transformer encoder-decoder.

We show that our method generates output with higher factual correctness than vanilla encoder-decoder models without semantic information.

We also introduce an intrinsic evaluation framework for inspecting sentence planning with a rigorous human evaluation procedure to assess factual correctness in terms of alignment, aggregation and ordering performance.

2 Related Work

Factual correctness is one of the main issues for data-to-text generation: How to generate text according to the facts specified in the input triples without adding, deleting or replacing information?

The prevailing sequence-to-sequence (seq2seq) architectures typically address this issue via reranking Wen et al. (2015a); Dušek and Jurčíček (2016a); Juraska et al. (2018) or some sophisticated training techniques Nie et al. (2019); Kedzie and McKeown (2019); Qader et al. (2019). For applications where structured inputs are present, neural graph encoders Marcheggiani and Perez-Beltrachini (2018); Rao et al. (2019); Gao et al. (2020) or decoding of explicit graph references Logan et al. (2019) are applied for higher accuracy. Recently, large-scale pretraining has achieved SoTA results on WebNLG by fine-tuning T5 Kale and Rastogi (2020).

Several works aim to improve accuracy and controllability by dividing the end-to-end architecture into sentence planning and surface realisation. Castro Ferreira et al. (2019) feature a pipeline with multiple planning stages and Elder et al. (2019) introduce a symbolic intermediate representation in multi-stage neural generation. Moryossef et al. (2019, 2019)

use pattern matching to approximate the required planning annotation (entity mentions, their order and sentence splits).

Zhao et al. (2020) use a planning stage in a graph-based model – the graph is first reordered into a plan; the decoder conditions on both the input graph encoder and the linearized plan. Similarly, Fan et al. (2019) use a pipeline approach for story generation via SRL-based sketches. However, all of these pipeline-based approaches either require additional manual annotation or depend on a parser for the intermediate steps.

Other works, in contrast, learn planning and realisation jointly. For example, Su et al. (2018) introduce a hierarchical decoding model generating different parts of speech at different levels, while filling in slots between previously generated tokens. Puduppully et al. (2019) include a jointly trained content selection and ordering module that is applied before the main text generation step.The model is trained by maximizing the log-likelihood of the gold content plan and the gold output text. Li and Rush (2020) utilize posterior regularization in a structured variational framework to induce which input items are being described by each token of the generated text. Wiseman et al. (2018)

aim for better semantic control by using a Hidden Semi-Markov Model (HSMM) for splitting target sentences into short phrases corresponding to “templates”, which are then concatenated to produce the outputs. However it trades the controllability for fluency. Similarly,

Shen et al. (2020) explicitly segment target text into fragment units, while aligning them with their corresponding input. Shao et al. (2019) use a Hierarchical Variational Model to aggregate input items into a sequence of local latent variables and realize sentences conditioned on the aggregations. The aggregation strategy is controlled by sampling from a global latent variable.

In contrast to these previous works, we achieve input ordering and aggregation, input-output alignment and text generation control via interpretable states, while preserving fluency.

3 Joint Planning and Generation

We jointly learn to generate and plan by aligning facts

in the target text with parts of the input representation. We model this alignment using a Hidden Markov Model (HMM) that follows a hierarchical structure comprising two sets of latent states, corresponding to ordering and aggregation. The model is trained end-to-end and all intermediate steps are learned in a unified framework.

3.1 Model Overview

Let be a collection of input triples and their natural language description (human written target text). We first segment into a sequence of facts , where each fact roughly captures “who did what to whom” in one event. We follow the approach of Xu et al. (2020), where facts correspond to predicates and their arguments as identified by SRL (See Appendix B for more details). For example:

Each fact consists of a sequence of tokens . Unlike the text itself, the planning information, i.e. input aggregation and ordering, is not directly observable due to the absence of labelled datasets. AggGen

therefore utilises an HMM probabilistic model which assumes that there is an underlying hidden process that can be modeled by a first-order Markov chain. At each time step, a latent variable (in our case input triples) is responsible for emitting an observed variable (in our case a fact text segment). The HMM specifies a joint distribution on the observations and the latent variables. Here, a latent state

emits a fact , representing the group of input triples that is verbalized in . We write the joint likelihood as:

i.e., it is a product of the probabilities of each latent state transition (

transition distribution) and the probability of the observations given their respective latent state (emission distribution).

3.2 Parameterization

Latent State.

A latent state represents the input triples that are verbalized in the observed fact . It is not guaranteed that one fact always verbalizes only one triple (see bottom example in Figure 1). Thus, we represent state as a sequence of latent variables , where is the number of triples verbalized in . Figure 2 shows the structure of the model.

Figure 2: The structure of our model. , , , and represent the basic HMM structure, where , are latent states and , are observations. Inside the dashed frames is the corresponding structure for each latent state , which is a sequence of latent variables representing the predicates that emit the observation. For example, at time step two input triples (‘member_of’ and ‘operator’) are verbalized in the observed fact , whose predicates are represented as latent variables and . T1–4 represent transitions introduced in Section 3.2.

Let be a set of possible latent variables, then is the size of the search space for . If maps to unique triples, the search space becomes intractable for a large value of . To make the problem tractable, we decrease by representing triples by their predicate. thus stands for the collection of all predicates appearing in the corpus. To reduce the search space for further, we limit , where . 333By aligning the triples to facts using a rule-based aligner (see Section 5), we found that the chance of aggregating more than three triples to a fact is under 0.01% in the training set of both WebNLG and E2E datasets.

Transition Distribution.

The transition distribution between latent variables (T1 in Figure 2) is a matrix of probabilities, where each row sums to 1. We define this matrix as

(1)

where denotes the Hadamard product. and are matrices of predicate embeddings with dimension . is the set of predicates of the input triples , and each is the predicate of the triple . is a masking matrix, where if and , otherwise . We apply row-wise softmax over the resulting matrix to obtain probabilities.

The probability of generating the latent state (T2 in Figure 2) can be written as the joint distribution of the latent variables . Assuming a first-order Markov chain, we get:

where is a marked start-state.

On top of the generation probability of the latent states and , we define the transition distribution between two latent states (T3 in Figure 2) as:

where denotes the last latent variable in latent state , while denotes the first latent variable (other than the start-state) in latent state . We use two sets of parameters and to describe the transition distribution between latent variables within and across latent states, respectively.

Emission Distribution.

The emission distribution (T4 in Figure 2) describes the generation of fact conditioned on latent state and input triples . We define the probability of generating a fact as the product over token-level probabilities,

The first and last token of a fact are marked fact-start and fact-end tokens. We adopt Transformer Vaswani et al. (2017) as the model’s encoder and decoder.

Each triple is linearized into a list of tokens following the order: subject, predicate, and object. In order to represent individual triples, we insert special [SEP] tokens at the end of each triple. A special [CLS] token is inserted before all input triples, representing the beginning of the entire input. An example where the encoder produces a contextual embedding for the tokens of two input triples is shown in Figure 6 in Appendix E.

At time step , the decoder generates fact token-by-token autoregressively, conditioned on both the contextually-encoded input and the latent state . To guarantee that the generation of conditions only on the input triples whose predicate is in , we mask out the contextual embeddings of tokens from other unrelated triples for the encoder-decoder attention in all Transformer layers.

Autoregressive Decoding.

Autoregressive Hidden Markov Model (AR-HMM) introduces extra links into HMM to capture long-term correlations between observed variables, i.e., output tokens. Following wiseman-etal-2018-learning, we use AR-HMM for decoding, therefore allowing the interdependence between tokens to generate more fluent and natural text descriptions. Each token distribution depends on all the previously generated tokens, i.e., we define the token-level probabilities as instead of . During training, at each time step , we teacher-force the generation of the fact by feeding the ground-truth history, , to the word-level Transformer decoder. However, since only depends on the current hidden state , we only calculate the loss over .

3.3 Learning

We apply the backward algorithm Rabiner (1989) to learn the parameters introduced in Section 3.2, where we maximize , i.e., the marginal likelihood of the observed facts given input triples , over all the latent states and on the entire dataset using dynamic programming. Following Murphy (2012), and given that the latent state at time is , we define a conditional likelihood of future evidence as:

(2)

where denotes a group of predicates that are associated with the emission of . The size of ranges from 1 to and each component is from the collection of predicates (see Section 3.2). Then, the backward recurrences are:

with the base case . The marginal probability of over latent is then obtained as .

In Equation 2, the size of the search space for is , where , i.e., the number of unique predicates appearing in the dataset. The problem can still be intractable due to high , despite the simplifications explained in Section 3.2 (cf. predicates). To tackle this issue and reduce the search space of , we: (1) only explore permutations of

that include predicates appearing on the input; (2) introduce a heuristic based on the overlap of tokens between a triple and a

fact—if a certain fact mentions most tokens appearing in the predicate and object of a triple we hard-align it to this triple.444 This heuristic is using the rule-based aligner introduced in Section 5 with a threshold to rule out alignments in which the triples are not covered over 50%, since our model emphasises more on precision. Thus, not all triples are aligned to a fact. As a result, we discard the permutations that do not include the aligned predicates.

3.4 Inference

After the joint learning process, the model is able to plan, i.e., order and aggregate the input triples in the most likely way, and then generate a text description following the planning results. Therefore, the joint prediction of is defined as:

(3)

where denotes a set of planning results, is the text description, and is the planning result that is generated from.

The entire inference process (see Figure 3) includes three steps: input ordering, input aggregation, and text generation. The first two steps are responsible for the generation of together with their probabilities , while the last step is for the text generation .

Figure 3: The inference process (Section 3.4)

Planning: Input Ordering. The aim is to find the top-k most likely orderings of predicates appearing in the input triples. In order to make the search process more efficient, we apply left-to-right beam-search555We use beam search since Viterbi decoding aims at getting , but is not available at this stage. based on the transition distribution introduced in Equation 1. Specifically, we use a transition distribution between latent variables within latent states, calculated with predicate embeddings and (see Section 3.2). To guarantee that the generated sequence does not suffer from omission and duplication of predicates, we constantly update the masking matrix by removing generated predicates from the set . The planning process stops when is empty.

Planning: Input Aggregation. The goal is to find the top- most likely aggregations for each result of the Input Ordering step. To implement this process efficiently, we introduce a binary state for each predicate in the sequence: 0 indicates “wait” and 1 indicates “emit” (green squares in Figure 3). Then we list all possible combinations666We assume that each fact is comprised of triples at most. To match this assumption, we discard combinations containing a group that aggregates more than predicates. of the binary states for the Input Ordering result. For each combination, the aggregation algorithm proceeds left-to-right over the predicates and groups those labelled as “emit” with all immediately preceding predicates labelled as “wait”. In turn, we rank all the combinations with the transition distribution introduced in Equation 1. In contrast to the Input Ordering step, we use the transition distribution between latent variables across latent states, calculated with predicate embeddings and . That is, we do not take into account transitions between two consecutive predicates if they belong to the same group. Instead, we only consider consecutive predicates across two connected groups, i.e., the last predicate of the previous group with the first predicate of the following group.

Text Generation. The final step generates a text description conditioned on the input triples and the planning result (obtained from the Input Aggregation step). We use beam search and the planning-conditioned generation process described in Section 3.2 (“Emission Distribution”).

3.5 Controllability over sentence plans

While the jointly learnt model is capable of fully automatic generation including the planning step (see Section 3.4

), the discrete latent space allows direct access to manually control the planning component, which is useful in settings which require increased human supervision and is a unique feature of our architecture. The plans (latent variables) can be controlled in two ways: (1) hyperparameter. Our code offers a hyperparameter that can be tuned to control the level of aggregation: no aggregation, aggregate one, two triples, etc. The model can predict the most likely plan based on the input triples and the hyperparameter and generate a corresponding text description; (2) the model can directly adopt human-written plans, e.g. using the notation

[eatType][near customer-rating], which translates to: first generate ‘eatType’ as an independent fact and then aggregate the predicates ‘near’ and ‘customer-rating’ in the following fact and generate their joint description.

4 Experiments

4.1 Datasets

We tested our approach on two widely used data-to-text tasks: the E2E NLG Novikova et al. (2017) and WebNLG777Since we propose exploring sentence planning and increasing the controllability of the generation model and do not aim for a zero-shot setup, we only focus on the seen category in WebNLG. Gardent et al. (2017). Compared to E2E, WebNLG is smaller, but contains more predicates and has a larger vocabulary. Statistics with examples can be found in Appendix C. We followed the original training-development-test data split for both datasets.

4.2 Evaluation Metrics

Generation Evaluation focuses on evaluating the generated text with respect to its similarity to human-authored reference sentences. To compare to previous work, we adopt their associated metrics to evaluate each task. The E2E task is evaluated using BLEU Papineni et al. (2002), NIST Doddington (2002), ROUGE-L Lin (2004), METEOR Lavie and Agarwal (2007), and CIDEr Vedantam et al. (2015). WebNLG is evaluated in terms of BLEU, METEOR, and TER Snover et al. (2006).

Factual Correctness Evaluation tests if the generated text corresponds to the input triples Wen et al. (2015b); Reed et al. (2018); Dušek et al. (2020). We evaluated on the E2E test set using automatic slot error rate (SER),888 SER is based on regular expression matching. Since only the format of E2E data allows such patterns for evaluation, we only evaluate factual correctness on the E2E task.

i.e., an estimation of the occurrence of the input attributes (predicates) and their values in the outputs, implemented by

Dušek et al. (2020). SER counts predicates that were added, missed or replaced with a wrong object.

Intrinsic Planning Evaluation examines planning performance in Section 6.

4.3 Baseline model and Training Details

To evaluate the contributions of the planning component, we choose the vanilla Transformer model Vaswani et al. (2017) as our baseline, trained on pairs of linearized input triples and target texts. In addition, we choose two types of previous works for comparison: (1) best-performing models reported on the WebNLG 2017 (seen) and E2E dataset, i.e. T5 Kale and Rastogi (2020), PlanEnc Zhao et al. (2020), ADAPT Gardent et al. (2017), and TGen Dušek and Jurčíček (2016b); (2) models with explicit planning, i.e. TILB-PIPE Gardent et al. (2017), NTemp+AR Wiseman et al. (2018) and Shen et al. (2020).

To make our HMM-based approach converge faster, we initialized its encoder and decoder with the baseline model parameters and fine-tuned them during training of the transition distributions. Encoder and decoder parameters were chosen based on validation results of the baseline model for each task (see Appendix D for details).

5 Experiment Results

5.1 Generation Evaluation Results

Model BLEU TER METEOR
T5 64.70 0.46
PlanEnc 64.42 0.33 0.45
ADAPT 60.59 0.37 0.44
TILB-PIPE 44.34 0.48 0.38
Transformer 58.47 0.37 0.42
AggGen 58.74 0.40 0.43
AggGen 55.30 0.44 0.43
AggGen 52.17 0.50 0.44
Table 1: Generation Evaluation Results on the WebNLG (seen). The models labelled with are from previous work. The rest are our implementations.
Model BLEU NIST MET R-L CIDer Add Miss Wrong SER
TGen 66.41 8.5565 45.07 69.17 2.2253 00.14 04.11 00.03 04.27
NTemp+AR 59.80 7.5600 38.75 65.01 1.9500
Shen et al. (2020) 65.10 45.50 68.20 2.2410
Transformer 68.23 8.6765 44.31 6̱9.88 2.2153 00.30 04.67 00.20 05.16
AggGen 64.14 8.3509 45.13 66.62 2.1953 00.32 01.66 00.71 02.70
AggGen 58.90 7.9100 43.21 62.12 1.9656 01.65 02.99 03.01 07.65
AggGen 44.00 6.0890 43.75 58.24 0.8202 08.74 00.45 00.92 10.11
Table 2: Evaluation of Generation (middle) and Factual correctness (right) trained/tested on the original E2E data (Section 5 for metrics description). Models with are from previous work, the rest are our implementations.

Table 1 shows the generation results on the WebNLG seen category Gardent et al. (2017). Our model outperforms TILB-PIPE and Transformer, but performs worse than T5, PlanEnc and ADAPT. However, unlike these three models, our approach does not rely on large-scale pretraining, extra annotation, or heavy pre-processing using external resources. Table 2 shows the results when training and testing on the original E2E set. AggGen outperforms NTemp+AR and is comparable with Shen et al. (2020), but performs slightly worse than both seq2seq models in terms of word-overlap metrics.

However, the results in Table 3 demonstrate that our model does outperform the baselines on most surface metrics if trained on the noisy original E2E training set and tested on clean E2E data Dušek et al. (2019). This suggests that the previous performance drop was due to text references in the original dataset that did not verbalize all triples or added information not present in the triples that may have down-voted the fact-correct generations.999We also trained and tested models on the cleaned E2E data. The full results (including the factual correctness evaluation) are shown in Table 8 in Appendix F: there is a similar trend as in results in Table 3, compared to Transformer. This also shows that AggGen produces correct outputs even when trained on a noisy dataset. Since constructing high-quality data-to-text training sets is expensive and labor-intensive, this robustness towards noise is important.

Model BLEU NIST MET R-L CIDer
TGen 39.23 6.022 36.97 55.52 1.762
Transformer 38.57 5.756 35.92 55.45 1.668
AggGen 41.06 6.207 37.91 55.13 1.844
AggGen 38.24 5.951 36.56 51.53 1.653
AggGen 30.44 4.636 37.99 49.94 0.936
Table 3: Evaluation of Generation trained on the original E2E data, while tested on the cleaned E2E data. Note that, the clean test set has more diverse MRs and fewer references per MR, which leads to lower scores – see also the paper introducing the cleaned E2E data (Table 2 and 3 in Dušek et al. (2019)).

5.2 Factual Correctness Results

The results for factual correctness evaluated using SER on the original E2E test set are shown in Table 2. The SER of AggGen is the best among all models. Especially, the high “Miss” scores for TGen and Transformer demonstrate the high chance of information omission in vanilla seq2seq-based generators. In contrast, AggGen shows much better coverage over the input triples while keeping a low level of hallucination (low “Add” and “Wrong” scores).

5.3 Ablation variants

To explore the effect of input planning on text generation, we introduced two model variants: AggGen, where we replaced the Input Ordering with randomly shuffling the input triples before input aggregation, and AggGen, where the Input Ordering result was passed directly to the text generation and the text decoder generated a fact for each input triple individually.

The generation evaluation results on both datasets (Table 1 and Table 2) show that AggGen outperforms AggGen and AggGen substantially, which means both Input Ordering and Input Aggregation are critical. Table 2 shows that the factual correctness results for the ablative variants are much worse than full AggGen, indicating that planning is essential for factual correctness. An exception is the lower number of missed slots in AggGen. This is expected since AggGen generates a textual fact for each triple individually, which decreases the possibility of omissions at the cost of much lower fluency. This strategy also leads to a steep increase in added information.

Additionally, AggGen performs even worse on the E2E dataset than on the WebNLG set. This result is also expected, since input aggregation is more pronounced in the E2E dataset with a higher number of facts and input triples per sentence (cf. Appendix C).

5.4 Qualitative Error Analysis

We manually examined a sample of 100 outputs (50 from each dataset) with respect to their factual correctness and fluency. For factual correctness, we follow the definition of SER and check whether there are hallucinations, substitutions or omissions in generated texts. For fluency, we check whether the generated texts suffer from grammar mistakes, redundancy, or contain unfinished sentences. Figure 4 shows two examples of generated texts from Transformer and AggGen (more examples, including target texts generated by AggGen and AggGen, are shown in Table 6 and Table 7 in Appendix A). We observe that, in general, the seq2seq Transformer model tends to compress more triples into one fluent fact, whereas AggGen aggregates triples in more but smaller groups, and generates a shorter/simpler fact for each group. Therefore, the texts generated by Transformer are more compressed, while AggGen’s generations are longer with more sentences. However, the planning ensures that all input triples will still be mentioned. Thus, AggGen generates texts with higher factual correctness without trading off fluency101010 The number of fluent generations for Transformer and AggGen among the examined 100 examples are 96 and 95 respectively. The numbers for AggGen and AggGen are 86 and 74, which indicates that both Input Ordering and Input Aggregation are critical for generating fluent texts..

Figure 4: Examples of input and system-generated target text for E2E (top) and WebNLG (bottom). The sequences in square brackets are the sentence plans.

6 Intrinsic Evaluation of Planning

We now directly inspect the performance of the planning component by taking advantage of the readability of SRL-aligned facts. In particular, we investigate: (1) Sentence planning performance. We study the agreement between model’s planning and reference planning for the same set of input triples; (2) Alignment performance – we use AggGen as an aligner and examine its ability to align segmented facts to the corresponding input triples. Since both studies require ground-truth triple-to-fact alignments, which are not part of the WebNLG and E2E data, we first introduce a human annotation process in Section 6.1.

6.1 Human-annotated Alignments

We asked crowd workers on Amazon Mechanical Turk to align input triples to their fact-based text snippets to derive a “reference plan” for each target text.111111The evaluation requires human annotations, since anchor-based automatic alignments are not accurate enough (86%) for the referred plan annotation. See Table 5 (“RB”) for details. Each worker was given a set of input triples and a corresponding reference text description, segmented into a sequence of facts. The workers were then asked to select the triples that are verbalised in each fact.121212 The annotation guidelines and an example annotation task are shown in Figure 7 in Appendix G. We sampled 100 inputs from the WebNLG131313 We chose WebNLG over E2E for its domain and predicate diversity. test set for annotation. Each input was paired with three reference target texts from WebNLG. To guarantee the correctness of the annotation, three different workers annotated each input-reference pair. We only consider the alignments where all three annotators agree. Using Fleiss Kappa Fleiss (1971) over the facts aligned by each judge to each triple, we obtained an average agreement of 0.767 for the 300 input-reference pairs, which is considered high agreement.

6.2 Study of Sentence Planning

We now check the agreement between the model-generated and reference plans based on the top-1 Input Aggregation result (see Section 3.4). We introduce two metrics:

Normalized Mutual Information (NMI) Strehl and Ghosh (2002) to evaluate aggregation. We represent each plan as a set of clusters of triples, where a cluster contains the triples sharing the same fact verbalization. Using NMI we measure mutual information between two clusters, normalized into the 0-1 range, where 0 and 1 denote no mutual information and perfect correlation, respectively.

Kendall’s tau () Kendall (1945) is a ranking based measure which we use to evaluate both ordering and aggregation. We represent each plan as a ranking of the input triples, where the rank of each triple is the position of its associated fact verbalization in the target text. measures rank correlation, ranging from -1 (strong disagreement) to 1 (strong agreement).

Human 0.9340 0.7587 0.8415 0.2488
AggGen 0.7101 0.6247 0.6416 0.2064
Table 4: Planning Evaluation Results. NMI and K-tau calculated between human-written references (bottom), and between references and our system AggGen (top).

In the crowdsourced annotation (Section 6.1), each set of input triples contains three reference texts with annotated plans. We fist evaluate the correspondence among these three reference plans by calculating NMI and between one plan and the remaining two. In the top row of Table 4, the high average and maximum NMI indicate that the reference texts’ authors tend to aggregate input triples in similar ways. On the other hand, the low average shows that they are likely to order the aggregated groups differently. Then, for each set of input triples, we measure NMI and of the top-1 Input Aggregation result (model’s plan) against each of the corresponding reference plans and compute average and maximum values (bottom row in Table 4). Compared to the strong agreement among reference plans on the input aggregation, the agreement between model’s and reference plans is slightly weaker. Our model has slightly lower agreement on aggregation (NMI), but if we consider aggregation and ordering jointly (), the agreement between our model’s plans and reference plans is comparable to the agreement among reference plans.

6.3 Study of Alignment

Precision Recall F1
RB (%) 86.20 100.00 92.59
Vtb (%) 89.73 84.16 86.85
Table 5: Alignment Evaluation Results. Alignment accuracy for the Viterbi algorithm (Vtb) and the rule-based aligner (RB).

In this study, we use the HMM model as an aligner and assess its ability to align input triples with their fact verbalizations on the human-annotated set. Given the sequence of observed variables, a trained HMM-based model is able to find the most likely sequence of hidden states using Viterbi decoding. Similarly, given a set of input triples and a factoid segmented text, we use Viterbi with our model to align each fact with the corresponding input triple(s). We then evaluate the accuracy of the model-produced alignments against the crowdsourced alignments.

The alignment evaluation results are shown in Table 5. We compare the Viterbi (Vtb) alignments with the ones calculated by a rule-based aligner (RB) that aligns each triple to the fact with the greatest word overlap. The precision of the Viterbi aligner is higher than the rule-based aligner. However, the Viterbi aligner tends to miss triples, which leads to a lower recall. Since HMMs are locally optimal, the model cannot guarantee to annotate input triples once and only once.

7 Conclusion and Future Work

We show that explicit sentence planning, i.e., input ordering and aggregation, helps substantially to produce output which is both semantically correct as well as naturally sounding. Crucially, this also enables us to directly evaluate and inspect both the model’s planning and alignment performance by comparing to manually aligned reference texts. Our system outperforms vanilla seq2seq models when considering semantic accuracy and word-overlap based metrics. Experiment results also show that AggGen is robust to noisy training data. We plan to extend this work in three directions:

Other Generation Models. We plan to plug other text generators, e.g. pre-training based approaches Lewis et al. (2020); Kale and Rastogi (2020) , into AggGen to enhance their interpretability and controllability via sentence planning and generation.

Zero/Few-shot scenarios. Kale and Rastogi (2020)’s work on low-resource NLG uses a pre-trained language model with a schema-guided representation and hand-written templates to guide the representation in unseen domains and slots. These techniques can be plugged into AggGen, which allows us to examine the effectiveness of the explicit sentence planning in zero/few-shot scenarios.

Including Content Selection. In this work, we concentrate on the problem of faithful surface realization based on E2E and WebNLG data, which both operate under the assumption that all input predicates have to be realized in the output. In contrast, more challenging tasks such as RotoWire Wiseman et al. (2017), include content selection before sentence planning. In the future, we plan to include a content selection step to further extend AggGen’s usability.

Acknowledgments

This research received funding from the EPSRC project AISec (EP/T026952/1), Charles University project PRIMUS/19/SCI/10, a Royal Society research grant (RGS/R1/201482), a Carnegie Trust incentive grant (RIG009861), and an Apple NLU Research Grant to support research at Heriot-Watt University and Charles University. We thank Alessandro Suglia, Jindřich Helcl, and Henrique Ferrolho for suggestions on the draft. We thank the anonymous reviewers for their helpful comments.

References

  • Z. Cao, F. Wei, W. Li, and S. Li (2018) Faithful to the Original: Fact Aware Neural Abstractive Summarization. In AAAI, New Orleans, LA, USA. Note: arXiv: 1711.04434 External Links: Link Cited by: §1.
  • T. Castro Ferreira, C. van der Lee, E. van Miltenburg, and E. Krahmer (2019) Neural data-to-text generation: A comparison between pipeline and end-to-end architectures. In

    2019 Conference on Empirical Methods in Natural Language Processing (EMNLP) and 9th International Joint Conference on Natural Language Processing (IJCNLP)

    ,
    Hong Kong. External Links: Link Cited by: §2.
  • G. Doddington (2002)

    Automatic evaluation of machine translation quality using n-gram co-occurrence statistics

    .
    In Proceedings of the second international conference on Human Language Technology Research, pp. 138–145. Cited by: §4.2.
  • P. A. Duboue and K. R. McKeown (2001) Empirically estimating order constraints for content planning in generation. In Proceedings of the 39th Annual Meeting of the Association for Computational Linguistics, Toulouse, France, pp. 172–179. External Links: Link, Document Cited by: §1.
  • P. Duboue and K. McKeown (2002)

    Content planner construction via evolutionary algorithms and a corpus-based fitness function

    .
    In Proceedings of the International Natural Language Generation Conference, Harriman, New York, USA, pp. 89–96. External Links: Link Cited by: §1.
  • O. Dušek, D. M. Howcroft, and V. Rieser (2019) Semantic noise matters for neural natural language generation. In Proc. of the 12th International Conference on Natural Language Generation, Tokyo, Japan, pp. 421–426. External Links: Link, Document, 1911.03905 Cited by: Appendix C, §5.1, Table 3.
  • O. Dušek and F. Jurčíček (2016a) Sequence-to-Sequence Generation for Spoken Dialogue via Deep Syntax Trees and Strings. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Berlin, pp. 45–51. External Links: Link Cited by: §2.
  • O. Dušek and F. Jurčíček (2016b) Sequence-to-sequence generation for spoken dialogue via deep syntax trees and strings. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Berlin, Germany, pp. 45–51. External Links: Link, Document Cited by: §4.3.
  • O. Dušek, J. Novikova, and V. Rieser (2020) Evaluating the State-of-the-Art of End-to-End Natural Language Generation: The E2E NLG Challenge. Computer Speech & Language 59, pp. 123–156. External Links: Document, 1901.11528 Cited by: §4.2.
  • O. Dušek, J. Novikova, and V. Rieser (2020) Evaluating the state-of-the-art of end-to-end natural language generation: the e2e nlg challenge. Computer Speech & Language 59, pp. 123 – 156. External Links: ISSN 0885-2308, Document, Link Cited by: §1.
  • H. Elder, J. Foster, J. Barry, and A. O’Connor (2019) Designing a symbolic intermediate representation for neural surface realization. In Proceedings of the Workshop on Methods for Optimizing and Evaluating Neural Language Generation, Minneapolis, Minnesota, pp. 65–73. External Links: Link, Document Cited by: §2.
  • A. Fan, M. Lewis, and Y. Dauphin (2019) Strategies for Structuring Story Generation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 2650–2660. External Links: Document Cited by: §2.
  • J. L. Fleiss (1971) Measuring nominal scale agreement among many raters.. Psychological bulletin 76 (5), pp. 378. Cited by: §6.1.
  • H. Gao, L. Wu, P. Hu, and F. Xu (2020) RDF-to-Text Generation with Graph-augmented Structural Neural Encoders. In

    Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence

    ,
    Yokohama, Japan, pp. 3030–3036. External Links: ISBN 978-0-9992411-6-5, Link, Document Cited by: §2.
  • C. Gardent, A. Shimorina, S. Narayan, and L. Perez-Beltrachini (2017) Creating training corpora for NLG micro-planners. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vancouver, Canada, pp. 179–188. External Links: Link, Document Cited by: §1, §4.1.
  • C. Gardent, A. Shimorina, S. Narayan, and L. Perez-Beltrachini (2017) The WebNLG Challenge: Generating Text from RDF Data. In Proceedings of the 10th International Conference on Natural Language Generation, Santiago de Compostela, Spain, pp. 124–133. External Links: Link Cited by: §4.3, §5.1.
  • A. Gatt and E. Krahmer (2018) Survey of the state of the art in natural language generation: core tasks, applications and evaluation. Journal of Artificial Intelligence Research 61, pp. 65–170. Cited by: §1.
  • L. He, K. Lee, O. Levy, and L. Zettlemoyer (2018) Jointly predicting predicates and arguments in neural semantic role labeling. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Melbourne, Australia, pp. 364–369. External Links: Link Cited by: Appendix B.
  • J. Juraska, P. Karagiannis, K. K. Bowden, and M. A. Walker (2018) A Deep Ensemble Model with Slot Alignment for Sequence-to-Sequence Natural Language Generation. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), New Orleans, LA, USA, pp. 152–162. External Links: Link Cited by: §2.
  • M. Kale and A. Rastogi (2020) Text-to-text pre-training for data-to-text tasks. In Proceedings of the 13th International Conference on Natural Language Generation, Dublin, Ireland, pp. 97–102. External Links: Link Cited by: §2, §4.3, §7, §7.
  • C. Kedzie and K. McKeown (2019) A Good Sample is Hard to Find: Noise Injection Sampling and Self-Training for Neural Language Generation Models. In INLG, Tokyo, Japan. External Links: Link Cited by: §2.
  • M. G. Kendall (1945) The treatment of ties in ranking problems. Biometrika, pp. 239–251. Cited by: §6.2.
  • D. Kingma and J. Ba (2015) Adam: A Method for Stochastic Optimization. In Proceedings of the 3rd International Conference on Learning Representations, San Diego, CA, USA. Note: arXiv: 1412.6980 External Links: Link Cited by: Appendix D.
  • I. Konstas and M. Lapata (2013) Inducing document plans for concept-to-text generation. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, Seattle, Washington, USA, pp. 1503–1514. External Links: Link Cited by: §1.
  • A. Lavie and A. Agarwal (2007) METEOR: an automatic metric for mt evaluation with high levels of correlation with human judgments. In Proceedings of the second workshop on statistical machine translation, pp. 228–231. External Links: Link Cited by: §4.2.
  • M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mohamed, O. Levy, V. Stoyanov, and L. Zettlemoyer (2020) BART: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, pp. 7871–7880. External Links: Link, Document Cited by: §7.
  • X. L. Li and A. Rush (2020) Posterior control of blackbox generation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, pp. 2731–2743. External Links: Link, Document Cited by: §2.
  • C. Lin (2004) Rouge: a package for automatic evaluation of summaries. In Text summarization branches out, pp. 74–81. Cited by: §4.2.
  • R. Logan, N. F. Liu, M. E. Peters, M. Gardner, and S. Singh (2019)

    Barack’s Wife Hillary: Using Knowledge Graphs for Fact-Aware Language Modeling

    .
    In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 5962–5971. External Links: Link, Document Cited by: §2.
  • D. Marcheggiani and L. Perez-Beltrachini (2018) Deep Graph Convolutional Encoders for Structured Data to Text Generation. In Proceedings of the 11th International Conference on Natural Language Generation, Tilburg University, The Netherlands, pp. 1–9. External Links: Link Cited by: §2.
  • A. Moryossef, I. Dagan, and Y. Goldberg (2019) Improving Quality and Efficiency in Plan-based Neural Data-to-Text Generation. In INLG, Tokyo, Japan. External Links: Link Cited by: §2.
  • A. Moryossef, Y. Goldberg, and I. Dagan (2019) Step-by-Step: Separating Planning from Realization in Neural Data-to-Text Generation. In NAACL, Minneapolis, MN, USA. External Links: Link Cited by: §1, §2.
  • K. P. Murphy (2012) Machine learning: a probabilistic perspective. MIT press. Cited by: §3.3.
  • F. Nie, J. Yao, J. Wang, R. Pan, and C. Lin (2019) A simple recipe towards reducing hallucination in neural surface realisation. Florence, Italy. External Links: Link Cited by: §2.
  • J. Novikova, O. Dušek, and V. Rieser (2017) The E2E dataset: new challenges for end-to-end generation. In Proceedings of the 18th Annual SIGdial Meeting on Discourse and Dialogue, Saarbrücken, Germany, pp. 201–206. External Links: Link, Document Cited by: §1, §4.1.
  • K. Papineni, S. Roukos, T. Ward, and W. Zhu (2002) Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, Pennsylvania, USA, pp. 311–318. External Links: Link, Document Cited by: §4.2.
  • R. Puduppully, L. Dong, and M. Lapata (2019) Data-to-text generation with content selection and planning. In Proceedings of the 33rd AAAI Conference on Artificial Intelligence, Honolulu, Hawaii. External Links: Link Cited by: §2.
  • R. Qader, F. Portet, and C. Labbe (2019) Semi-Supervised Neural Text Generation by Joint Learning of Natural Language Generation and Natural Language Understanding Models. In INLG, Tokyo, Japan. External Links: Link Cited by: §2.
  • L. R. Rabiner (1989) A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE 77 (2), pp. 257–286. Cited by: §3.3.
  • J. Rao, K. Upasani, A. Balakrishnan, M. White, A. Kumar, and R. Subba (2019) A Tree-to-Sequence Model for Neural NLG in Task-Oriented Dialog. In INLG, Tokyo, Japan. External Links: Link Cited by: §2.
  • L. Reed, S. Oraby, and M. Walker (2018) Can neural generators for dialogue learn sentence planning and discourse structuring?. In Proceedings of the 11th International Conference on Natural Language Generation, Tilburg University, The Netherlands, pp. 284–295. External Links: Link, Document Cited by: §4.2.
  • E. Reiter and R. Dale (2000) Building natural language generation systems. Cambridge university press. Cited by: §1.
  • A. Rohrbach, L. A. Hendricks, K. Burns, T. Darrell, and K. Saenko (2018)

    Object Hallucination in Image Captioning

    .
    In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, pp. 4035–4045. External Links: Link Cited by: §1.
  • Z. Shao, M. Huang, J. Wen, W. Xu, and X. Zhu (2019) Long and diverse text generation with planning-based hierarchical variational model. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, pp. 3257–3268. External Links: Link, Document Cited by: §1, §2.
  • X. Shen, E. Chang, H. Su, J. Zhou, and D. Klakow (2020) Neural data-to-text generation via jointly learning the segmentation and correspondence. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, External Links: Document Cited by: §1, §2, §4.3, §5.1, Table 2.
  • M. Snover, B. Dorr, R. Schwartz, L. Micciulla, and J. Makhoul (2006) A study of translation edit rate with targeted human annotation. In Proceedings of association for machine translation in the Americas, Vol. 200. Cited by: §4.2.
  • A. Strehl and J. Ghosh (2002) Cluster ensembles—a knowledge reuse framework for combining multiple partitions. Journal of machine learning research 3 (Dec), pp. 583–617. Cited by: §6.2.
  • S. Su, K. Lo, Y. Yeh, and Y. Chen (2018) Natural language generation by hierarchical decoding with linguistic patterns. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), New Orleans, Louisiana, pp. 61–66. External Links: Link, Document Cited by: §2.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §3.2, §4.3.
  • R. Vedantam, C. Lawrence Zitnick, and D. Parikh (2015) Cider: consensus-based image description evaluation. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    ,
    pp. 4566–4575. Cited by: §4.2.
  • T. Wen, M. Gasic, D. Kim, N. Mrksic, P. Su, D. Vandyke, and S. Young (2015a)

    Stochastic Language Generation in Dialogue using Recurrent Neural Networks with Convolutional Sentence Reranking

    .
    In Proceedings of the 16th Annual Meeting of the Special Interest Group on Discourse and Dialogue, Prague, Czech Republic, pp. 275–284. External Links: Link Cited by: §2.
  • T. Wen, M. Gašić, N. Mrkšić, P. Su, D. Vandyke, and S. Young (2015b) Semantically conditioned LSTM-based natural language generation for spoken dialogue systems. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal, pp. 1711–1721. External Links: Link, Document Cited by: §4.2.
  • S. Wiseman, S. Shieber, and A. Rush (2017) Challenges in data-to-document generation. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Copenhagen, Denmark, pp. 2253–2263. External Links: Link, Document Cited by: §7.
  • S. Wiseman, S. Shieber, and A. Rush (2018) Learning neural templates for text generation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, pp. 3174–3187. External Links: Link, Document Cited by: §1, §2, §4.3.
  • X. Xu, O. Dušek, J. Li, V. Rieser, and I. Konstas (2020) Fact-based content weighting for evaluating abstractive summarisation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 5071–5081. Cited by: Appendix B, §3.1.
  • C. Zhao, M. Walker, and S. Chaturvedi (2020) Bridging the Structural Gap Between Encoding and Decoding for Data-To-Text Generation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, pp. 2481–2491. External Links: Link, Document Cited by: §1, §2, §4.3.

Appendix A Examples of input and system-generated target text

Input the cricketers [customerrating] high
the cricketers [eattype] restaurant
the cricketers [near] express by holiday inn
Transf Predicted Text: the cricketers is a highly rated restaurant near express by holiday inn.
AggGen Sentence Planing: [eatType] [customerrating near]
Predicted Text: the cricketers is a highly rated restaurant located near express by holiday inn.
AggGen Sentence Planing: [eatType] [near] [customerrating]
Predicted Text: the cricketers is a restaurant located near express by holiday inn . it has a high customer rating.
AggGen Sentence Planing: [eatType near] [customerrating]
Predicted Text: the cricketers is a restaurant near express by holiday inn . it has a high customer rating.
Input blue spice [eattype] pub
blue spice [familyfriendly] yes
blue spice [food] english
blue spice [near] rainbow vegetarian café
Transf Predicted Text: the blue spice is a family friendly pub near the rainbow vegetarian café. it serves english food in the higher price range.
AggGen Sentence Planing: [eatType familyFriendly] [food near]
Predicted Text: blue spice is a family friendly pub serving english food near the rainbow vegetarian café.
AggGen Sentence Planing: [eatType] [familyFriendly] [food] [near]
Predicted Text: blue spice is a mid - priced pub that is family friendly and serves british food . it is located near the rainbow vegetarian café.
AggGen Sentence Planing: [near familyFriendly eatType] [food]
Predicted Text: blue spice is a family friendly pub near the rainbow vegetarian café . it serves english food.
Input the cricketers [area] riverside
the cricketers [customerrating] 5 out of 5
the cricketers [eattype] restaurant
the cricketers [familyfriendly] yes
the cricketers [food] chinese
the cricketers [near] all bar one
Transf Predicted Text: the cricketers is a family friendly chinese restaurant near all bar one in the riverside area . it has a customer rating of 5 out of 5.
AggGen Sentence Planing: [eatType familyFriendly] [food area near] [customerrating]
Predicted Text: the cricketers is a family friendly chinese restaurant located in the riverside area near all bar one . it has a customer rating of 5 out of 5.
AggGen Sentence Planing: [eatType] [food] [area] [customerrating] [familyFriendly] [near]
Predicted Text: the cricketers is a restaurant that serves chinese food. it is located in the riverside area and has a customer rating of 5 out of 5 and is children friendly. the coffee shop is near all bar one.
AggGen Sentence Planing: [customerrating] [familyFriendly eatType] [food near area]
Predicted Text: the cricketers is a 5 out of 5 rated , family friendly chinese restaurant located in riverside near all bar one.
Input the cricketers [area] city centre
the cricketers [customerrating] average
the cricketers [eattype] restaurant
the cricketers [familyfriendly] yes
the cricketers [food] chinese
the cricketers [near] all bar one
the cricketers [pricerange] high
Transf Predicted Text: the cricketers is a chinese restaurant near all bar one in the city centre . it is children friendly and has an average customer rating.
AggGen Sentence Planing: [eatType priceRange] [food customerrating] [familyFriendly area near]
Predicted Text: the cricketers is a chinese restaurant with a high price range . it has an average customer rating and is children friendly near all bar one in the city centre.
AggGen Sentence Planing: [eatType] [priceRange] [food] [customerrating] [familyFriendly] [area] [near]
Predicted Text: the cricketers is a chinese restaurant that is children friendly, has a high price range, a customer rating of 3 out of 5, is located near the portland arms and is in the city centre. it has an average customer rating. is also a children-friendly . you can find it is called the cricketers.
AggGen Sentence Planing: [area] [familyFriendly priceRange] [eatType food near] [customerrating]
Predicted Text: located in the city centre, the cricketers is a child friendly chinese restaurant with a high price range. it is near all bar one. its customer rating is average.
Table 6: Examples of input and system-generated target text for E2E NLG dataset.
Input 11th mississippi infantry monument [category] contributing property
11th mississippi infantry monument [country] united states
11th mississippi infantry monument [established] 2000
11th mississippi infantry monument [location] seminary ridge
11th mississippi infantry monument [municipality] gettysburg, pennsylvania
Transf Predicted Text: the 11th mississippi infantry monument is located at seminary ridge, gettysburg, pennsylvania, united states. it was established in 2000 and falls under the category of contributing property.
AggGen Sentence Planing: [location country established] [category] [municipality]
Predicted Text: the 11th mississippi infantry monument, established in 2000, is located at seminary ridge, gettysburg, pennsylvania, united states. it is categorized as a contributing property and is in the municipality of gettysburg pennsylvania.
AggGen Sentence Planing: [country] [established] [category] [municipality] [location]
Predicted Text: the 11th mississippi infantry monument is located in gettysburg, pennsylvania, united states. the monument was established in 2000 and falls under the category of contributing property. it can be found in the municipality of gettysburg in pennsylvania and is in the seminary ridge.
AggGen Sentence Planing: [municipality] [category] [country established] [location]
Predicted Text: the 11th mississippi infantry monument is in the municipality of gettysburg, pennsylvania. it is categorized as a contributing property. the monument was established in the year 2000 in the united states and is located at seminary ridge.
Input apollo 8 [backup_pilot] buzz aldrin
apollo 8 [crewmembers] frank borman
apollo 8 [operator] nasa
william anders [birthplace] british hong kong
william anders [dateofretirement] 1969-09-01
william anders [was_a_crew_member_of] apollo 8
Transf Predicted Text: william anders (born in british hong kong) was a crew member of nasa’s apollo 8 alongside frank borman. william anders retired on september 1st, 1969.
AggGen Sentence Planing: [birthPlace] [was_a_crew_member_of] [operator crewMembers] [backup_pilot] [dateOfRetirement]
Predicted Text: william anders was born in british hong kong and served as a crew member on apollo 8. frank borman was a crewman aboard the nasa operated apollo 8 mission. the backup pilot was buzz aldrin. william anders retired on september 1st, 1969.
AggGen Sentence Planing: [birthPlace] [was_a_crew_member_of] [operator] [crewMembers] [backup_pilot] [dateOfRetirement]
Predicted Text: william anders (born in british hong kong) was a crew member of apollo 8 operated by nasa, where buzz aldrin was a backup pilot. frank borman was also a crewman aboard apollo 8. buzz aldrin’s backup pilot was buzz aldrin. william anders retired on september 1st, 1969.
AggGen Sentence Planing: [backup_pilot] [was_a_crew_member_of] []birthPlace] [dateOfRetirement operator] [crewMembers]
Predicted Text: buzz aldrin was a crew member of apollo 8 with buzz aldrin as backup pilot and frank borman as buzz aldrin. william anders was a crewman aboard apollo 8. british hong kong born william anders retired. he was a backup pilot on the apollo 8 mission was operated by nasa. frank borman was also a crewman.
Input school of business and social sciences at the aarhus university [affiliation] european university association
school of business and social sciences at the aarhus university [city] aarhus
school of business and social sciences at the aarhus university [country] denmark
school of business and social sciences at the aarhus university [dean] thomas pallesen
school of business and social sciences at the aarhus university [established] 1928
school of business and social sciences at the aarhus university [latinname] universitas aarhusiensis
school of business and social sciences at the aarhus university [numberofstudents] 16000
Transf Predicted Text: the school of business and social sciences at the aarhus university in aarhus, denmark was established in 1928. it has 737 academic staff and 16,000 students. its dean is thomas pallesen and it is affiliated with the european university association.
AggGen Sentence Planing: [country established city] [affiliation latinName numberOfStudents] [dean]
Predicted Text: the school of business and social sciences at the aarhus university in aarhus, denmark was established in 1928. it has 737 academic staff and 16,000 students. its latin name is ”universitas aarhusiensis” and it is affiliated with the european university association. its dean is thomas pallesen.
AggGen Sentence Planing: [country] [established] [city] [affiliation] [latinName] [numberOfStudents] [dean]
Predicted Text: the school of business and social sciences at the aarhus university in denmark is affiliated with the european university association in brussels. the school was established in 1928 and is located in the city of aarhus. it is affiliated to the european univerrsity in the school the latin name of ”universitas aarhusiensis”. and there are 16000 students. its dean is thomas pallesen”.
AggGen Sentence Planing: [latinName numberOfStudents] [country established] [dean affiliation] [city]
Predicted Text: the school of business and social sciences at the aarhus university in aarhus, denmark has 16,000 students and its latin name is ”universitas aarhusiensis”. it was established in 1928 in the country of denmark and it is affiliated with the european university association. the current dean is thomas pallesen and it has strong connections with the school also affiliated to its dean of business. and it also home to the school.
Table 7: Examples of input and system-generated target text for WebNLG dataset.

Appendix B Factoid Sentence Segmentation

In order to align meaningful parts of the human-written target text to semantic triples, we first segment the target sentences into sequences of facts using SRL, following Xu et al. (2020). The aim is to break down sentences into sub-sentences (facts) that verbalize as few input triples as possible; the original sentence can still be fully recovered by concatenating all its sub-sentences. Each fact is represented by a segment of the original text that roughly captures “who did what to whom” in one event. We first parse the sentences into SRL propositions using the implementation of He et al. (2018).141414The code can be found in https://allennlp.org with 86.49 test F1 on the Ontonotes 5.0 dataset. We consider each predicate-argument structure as a separate fact, where the predicate stands for the event and its arguments are mapped to actors, recipients, time, place, etc. (see Figure 5). The sentence segmentation consists of two consecutive steps:

(1) Tree Construction, where we construct a hierarchical tree structure for all the facts of one sentence, by choosing the fact with the largest coverage as the root and recursively building sub-trees by replacing arguments with their corresponding sub-facts (ARG1 in FACT1 is replaced by FACT2).

(2) Argument Grouping, where each predicate (FACT in tree) with its leaf-arguments corresponds to a sub-sentence. For example, in Figure 5, leaf-argument “was” and “a crew member on Apollo 8” of FACT1 are grouped as one sub-sentence.

Figure 5: Semantic Role Labeling based tree meaning representation and factoid sentence segmentation for text “William Anders, who retired on September 1st, 1969, was a crew member on Apollo 8.”

Appendix C Datasets

WebNLG.

The corpus contains 21K instances (input-text pairs) from 9 different domains (e.g., astronauts, sports teams). The number of input triples ranges from 1 to 7, with an average of 2.9. The average number of facts that each text contains is 2.4 (see Appendix B). The corpus contains 272 distinct predicates. The vocabulary size for input and output side is 2.6K and 5K respectively.

E2e Nlg.

The corpus contains 50K instances from the restaurant domain. We automatically convert the original attribute-value pairs to triples: For each instance, we take the restaurant name as the subject and use it along with the remaining attribute-value pairs as corresponding predicates and objects. The number of triples in each input ranges from 1 to 7 with an average of 4.4. The average number of facts that each text contains is 2.6. The corpus contains 9 distinct predicates. The vocabulary size for inputs and outputs is 120 and 2.4K respectively. We also tested our approach on an updated cleaned release Dušek et al. (2019).

Appendix D Hyperparameters

WebNLG.

Both encoder and decoder are a 2-layer 4-head Transformer, with hidden dimension of 256. The size of token embeddings and predicate embeddings is 256 and 128, respectively. The Adam optimizer Kingma and Ba (2015) is used to update parameters. For both the baseline model and the pre-train of the HMM-based model, the learning rate is 0.1. During the training of the HMM-based model, the learning rate for the encoder-decoder fine-tuning and the training of the transition distributions is set as 0.002 and 0.01, respectively.

E2e.

Both encoder and decoder are a Transformer with hidden dimension of 128. The size of token embeddings and predicate embeddings is 128 and 32, respectively. The rest hyper-parameters are same with WebNLG.

Appendix E Parameterization: Emission Distribution

Figure 6: The Transformer encoder takes linearized triples and produces contextual embeddings We assume that, at time step , the Transformer decoder is generating fact conditioned on . The number of latent variables is 1. In other words, . If the value of is the predicate of the first triple (solid borders), then the second triple (dashed borders) is masked out for the encoder-decoder attention during decoding.

Appendix F Full Experiment Results on E2E

Model Train Test BLEU NIST MET R-L CIDer Add Miss Wrong SER
TGen Original Clean 39.23 6.0217 36.97 55.52 1.7623 00.40 03.59 00.07 04.05
Transformer 38.57 5.7555 35.92 55.45 1.6676 02.13 05.71 00.51 08.35
AggGen 41.06 6.2068 37.91 55.13 1.8443 02.04 03.38 00.64 06.06
AggGen 38.24 5.9509 36.56 51.53 1.6525 02.94 03.67 02.18 08.80
AggGen 30.44 4.6355 37.99 49.94 0.9359 08.71 01.60 00.87 11.24
TGen Clean Clean 40.73 6.1711 37.76 56.09 1.8518 00.07 00.72 00.08 00.87
Transformer 38.62 6.0804 36.03 54.82 1.7544 03.15 04.56 01.32 09.02
AggGen 39.88 6.1704 37.35 54.03 1.8193 01.10 01.85 01.25 04.21
AggGen 38.28 6.0027 36.94 51.55 1.6397 01.74 02.74 00.62 05.11
AggGen 26.92 4.2877 36.60 47.95 0.9205 05.99 01.54 02.31 09.98
Table 8: Evaluation of Generation (middle) and Factual correctness (right) on the E2E NLG data (see Section 5 for metrics decription).

Appendix G Annotation interface

Figure 7: An example of the fact-triple alignment task (highlights correspond to facts).