Neural Data-to-Text Generation via Jointly Learning the Segmentation and Correspondence

05/03/2020 ∙ by Xiaoyu Shen, et al. ∙ Max Planck Society 0

The neural attention model has achieved great success in data-to-text generation tasks. Though usually excelling at producing fluent text, it suffers from the problem of information missing, repetition and "hallucination". Due to the black-box nature of the neural attention architecture, avoiding these problems in a systematic way is non-trivial. To address this concern, we propose to explicitly segment target text into fragment units and align them with their data correspondences. The segmentation and correspondence are jointly learned as latent variables without any human annotations. We further impose a soft statistical constraint to regularize the segmental granularity. The resulting architecture maintains the same expressive power as neural attention models, while being able to generate fully interpretable outputs with several times less computational cost. On both E2E and WebNLG benchmarks, we show the proposed model consistently outperforms its neural attention counterparts.



There are no comments yet.


page 6

page 8

page 13

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Data-to-text generation aims at automatically producing natural language descriptions of structured database Reiter and Dale (1997). Traditional statistical methods usually tackle this problem by breaking the generation process into a set of local decisions that are learned separately Belz (2008); Angeli et al. (2010); Kim and Mooney (2010); Oya et al. (2014). Recently, neural attention models conflate all steps into a single end-to-end system and largely simplify the training process Mei et al. (2016); Lebret et al. (2016); Shen et al. (2017); Su et al. (2018, 2019); Chang et al. (2020). However, the black-box conflation also renders the generation uninterpretable and hard to control Wiseman et al. (2018); Shen et al. (2019a). Verifying the generation correctness in a principled way is non-trivial. In practice, it often suffers from the problem of information missing, repetition and “hallucination” Dušek et al. (2018, 2020).

Source data: Name[Clowns], PriceRange[more than £30], EatType[pub], FamilyFriendly[no] Generation: 1⃝Name 2⃝(Clowns) 3⃝FamilyFriendly 4⃝(is a child-free) 5⃝PriceRange 6⃝(, expensive) 7⃝EatType 8⃝(pub.)

Figure 1: Generation from our model on the E2E dataset. Decoding is performed segment-by-segment. Each segment realizes one data record. 1⃝~8⃝ mark the decision order in the generation process.

In this work, we propose to explicitly exploit the segmental structure of text. Specifically, we assume the target text is formed from a sequence of segments. Every segment is the result of a two-stage decision: (1) Select a proper data record to be described and (2) Generate corresponding text by paying attention only to the selected data record. This decision is repeated until all desired records have been realized. Figure 1 illustrates this process.

Compared with neural attention, the proposed model has the following advantages: (1) We can monitor the corresponding data record for every segment to be generated. This allows us to easily control the output structure and verify its correctness111For example, we can perform a similar constrained decoding as in Balakrishnan et al. (2019) to rule out outputs with undesired patterns.. (2) Explicitly building the correspondence between segments and data records can potentially reduce the hallucination, as noted in Wu et al. (2018); Deng et al. (2018) that hard alignment usually outperforms soft attention. (3) When decoding each segment, the model pays attention only to the selected data record instead of averaging over the entire input data. This largely reduces the memory and computational costs 222Coarse-to-fine attention Ling and Rush (2017); Deng et al. (2017)

was proposed for the same motivation, but they resort to reinforcement learning which is hard to train, and the performance is sacrificed for efficiency.


To train the model, we do not

rely on any human annotations for the segmentation and correspondence, but rather marginalize over all possibilities to maximize the likelihood of target text, which can be efficiently done within polynomial time by dynamic programming. This is essentially similar to traditional methods of inducing segmentation and alignment with semi-markov models 

Daumé III and Marcu (2005); Liang et al. (2009). However, they make strong independence assumptions thus perform poorly as a generative model Angeli et al. (2010). In contrast, the transition and generation in our model condition on all previously generated text

. By integrating an autoregressive neural network structure, our model is able to capture unbounded dependencies while still permitting tractable inference. The training process is stable as it does not require any sampling-based approximations. We further add a soft statistical constraint to control the segmentation granularity via posterior regularization 

Ganchev et al. (2010)

. On both the E2E and WebNLG benchmarks, our model is able to produce significantly higher-quality outputs while being several times computationally cheaper. Due to its fully interpretable segmental structure, it can be easily reconciled with heuristic rules or hand-engineered constraints to control the outputs 

333Code to be released soon.

2 Related Work

Data-to-text generation is traditionally dealt with using a pipeline structure containing content planning, sentence planning and linguistic realization Reiter and Dale (1997). Each target text is split into meaningful fragments and aligned with corresponding data records, either by hand-engineered rules Kukich (1983); McKeown (1992) or statistical induction Liang et al. (2009); Koncel-Kedziorski et al. (2014); Qin et al. (2018). The segmentation and alignment are used as supervision signals to train the content and sentence planner Barzilay and Lapata (2005); Angeli et al. (2010). The linguistic realization is usually implemented by template mining from the training corpus Kondadadi et al. (2013); Oya et al. (2014). Our model adopts a similar pipeline generative process, but integrates all the sub-steps into a single end-to-end trainable neural architecture. It can be considered as a neural extension of the PCFG system in Konstas and Lapata (2013)

, with a more powerful transition probability considering inter-segment dependence and a state-of-the-art attention-based language model as the linguistic realizer.

Wiseman et al. (2018) tried a similar neural generative model to induce templates. However, their model only captures loose data-text correspondence and adopts a weak markov assumption for the segment transition probability. Therefore, it underperforms the neural attention baseline as for generation. Our model is also in spirit related to recent attempts at separating content planning and surface realization in neural data-to-text models Zhao et al. (2018); Puduppully et al. (2019); Moryossef et al. (2019); Ferreira et al. (2019). Nonetheless, all of them resort to manual annotations or hand-engineered rules applicable only for a narrow domain. Our model, instead, automatically learn the optimal content planning via exploring over exponentially many segmentation/correspondence possibilities.

There have been quite a few neural alignment models applied to tasks like machine translation Wang et al. (2018); Deng et al. (2018), character transduction Wu et al. (2018); Shankar and Sarawagi (2019) and summarization Yu et al. (2016); Shen et al. (2019b). Unlike word-to-word alignment, we focus on learning the alignment between data records and text segments. Some works also integrate neural language models to jointly learn the segmentation and correspondence, e.g., phrase-based machine translation Huang et al. (2018), speech recognition Wang et al. (2017) and vision-grounded word segmentation Kawakami et al. (2019). Data-to-text naturally fits into this scenario since each data record is normally verbalized in one continuous text segment.

3 Background: Data-to-Text

Let denote a source-target pair. is structured data containing a set of records and corresponds to which is a text description of . The goal of data-to-text generation is to learn a distribution to automatically generate proper text describing the content of the data.

The neural attention architecture handles this task with an encode-attend-decode process Bahdanau et al. (2015). The input is processed into a sequence of , normally by flattening the data records Wiseman et al. (2017). The encoder encodes each

into a vector

. At each time step, the decoder attends to encoded vectors and outputs the probability of the next token by . is a weighted average of source vectors:


is the hidden state of the decoder at time step . is a score function to compute the similarity between and  Luong et al. (2015).

4 Approach

Suppose the input data contains a set of records . Our assumption is that the target text can be segmented into a sequence of fragments. Each fragment corresponds to one data record. As the ground-truth segmentation and correspondence are not available, we need to enumerate over all possibilities to compute the likelihood of . Denote by the set containing all valid segmentation of . For any valid segmentation , , where means concatenation and is the number of segments. For example, let and . One possible segmentation would be . is the end-of-segment symbol and is removed when applying the operator. We further define to be the corresponding data record(s) of . The likelihood of each text is then computed by enumerating over all possibilities of and :


Every segment is generated by first selecting the data record based on the transition probability , then generating tokens based on the word generation probability . Figure 2 illustrates the generation process of our model.

Figure 2: Generation process of our approach. Segment end symbol is ignored when updating the state of the decoder. Solid arrows indicate the transition model and dashed arrows indicate the generation model. Every segment is generated by attending only to the corresponding data record .

Generation Probability

We base the generation probability on the same decoder as in neural attention models. The only difference is that the model can only pay attention to its corresponding data record. The attention scores of other records are masked out when decoding :

where is the indicator function. This forces the model to learn proper correspondences and enhances the connection between each segment and the data record it describes.

Following the common practice, we define the output probability with the pointer generator See et al. (2017); Wiseman et al. (2017):

is the decoder’s hidden state at time step . denotes vector concatenation.

is the context vector. MLP indicates multi-layer perceptron and

normalizes the score between . and are trainable matrices. is the probability that the word is generated from a fixed vocabulary distribution instead of being copied. The final decoding probability is marginalized over and the copy distribution. The generation probability of factorizes over the words within it and the end-of-segment token:

Transition Probability

We make a mild assumption that is dependent only on and but irrelevant of , which is a common practice when modelling alignment Och et al. (1999); Yu et al. (2016); Shankar and Sarawagi (2019). The transition probability is defined as:


A softmax layer is finally applied to the above equation to normalize it as a proper probability distribution.

is a representation of

, which is defined as a max pooling over all the word embeddings contained in

. is the attention context vector when decoding the last token in , defined as in Equation 1. It carries important information from to help predict . is the hidden state of the neural decoder which goes through all history tokens . are trainable matrices to project and into the same dimension as .

We further add one constraint to prohibit self-transition, which can be easily done by zeroing out the transition probability in Equation 3 when . This forces the model to group together text describing the same data record.

Since Equation 3 conditions on all previously generated text, it is able to capture more complex dependencies as in semi-markov models Liang et al. (2009); Wiseman et al. (2018).

Null Record

In our task, we find some frequent phrases, e.g., “it is”, “and”, tend to be wrongly aligned with some random records, similar to the garbage collection issue in statistical alignment Brown et al. (1993). This hurt the model interpretability. Therefore, we introduce an additional null record to attract these non-content phrases. The context vector when aligned to is a zero vector so that the decoder will decode words based solely on the language model without relying on the input data.


Equation 2 contains exponentially many combinations to enumerate over. Here we show how to efficiently compute the likelihood with the forward algorithm in dynamic programming Rabiner (1989). We define the forward variable . With the base . The recursion goes as follows for :


The final likelihood of the target text can be computed as

. As the forward algorithm is fully differentiable, we maximize the log-likelihood of the target text by backpropagating through the dynamic programming. The process is essentially equivalent to the generalized EM algorithm 

Eisner (2016). By means of the modern automatic differentiation tools, we avoid the necessity to calculate the posterior distribution manually Kim et al. (2018).

To speed up training, we set a threshold to the maximum length of a segment as in Liang et al. (2009); Wiseman et al. (2018). This changes the complexity in Equation 4 to a constant instead of scaling linearly with the length of the target text. Moreover, as pointed out in Wang et al. (2017), the computation for the longest segment can be reused for shorter segments. We therefore first compute the generation and transition probability for the whole sequence in one pass. The intermediate results are then cached to efficiently proceed the forward algorithm without any re-computation.

One last issue is the numerical precision, it is important to use the log-space binary operations to avoid underflow Kim et al. (2017).

Near[riverside], Food[French], EatType[pub], Name[Cotto]
1. [Near][the][riverside][is a][French]
2. [Near the riverside][is][a French][pub]
  [called Cotto][.]
3. [Near the riverside][is a French][pub]
  [called Cotto .]
4. [Near the riverside][is a French pub]
  [called Cotto .]
Table 1: Segmentation with various granularities. 1 is too fine-grained while 4 is too coarse. We expect a segmentation like 2 or 3 to better control the generation.

Segmentation Granularity

There are several valid segmentations for a given text. As shown in Table 1, when the segmentation (example 1) is too fine-grained, controlling the output information becomes difficult because the content of one data record is realized in separate pieces 444The finer-grained segmentation might be useful if the focus is on modeling the detailed discourse structure instead of the information accuracy Reed et al. (2018); Balakrishnan et al. (2019), which we leave for future work.. When it is too coarse, the alignment might become less accurate (as in Example 4, “pub” is wrongly merged with previous words and aligned together to the “Food” record). In practice, we expect the segmentation to stay with accurate alignment yet avoid being too brokenly separated. To control the granularity as we want, we utilize posterior regularization Ganchev et al. (2010) to constrain the expected number of segments for each text 555We can also utilize some heuristic rules to help segmentation. For example, we can prevent breaking syntactic elements obtained from an external parser Yang et al. (2019) or match entity names with handcrafted rules Chen et al. (2018). The interpretability of the segmental structure allows easy combination with these rules. We focus on a general domain-agnostic method in this paper, though heuristic rules might bring further improvement under certain cases., which can be calculated by going through a similar forward pass as in Equation 4 Eisner (2002); Backes et al. (2018)

. Most computation is shared without significant extra burden. The final loss function is:


is the log-likelihood of target text after marginalizing over all valid segmentations. is the expected number of segments and

are hyperparameters. We use the max-margin loss to encourage

to stay close to under a tolerance range of .


The segment-by-segment generation process allows us to easily constrain the output structure. Undesirable patterns can be rejected before the whole text is generated. We adopt three simple constraints for the decoder:

  1. Segments must not be empty.

  2. The same data record cannot be realized more than once (except for the null record).

  3. The generation will not finish until all data records have been realized.

Constraint 2 and 3 directly address the information repetition and missing problem. When segments are incrementally generated, the constraints will be checked against for validity. Note that adding the constraints hardly incur any cost, the decoding process is still finished in one pass. No post-processing or reranking is needed.

Computational Complexity

Suppose the input data has records and each record contains tokens. The computational complexity for neural attention models is at each decoding step where the whole input is retrieved. Our model, similar to chunkwise attention Chiu and Raffel (2018) or coarse-to-fine attention Ling and Rush (2017), reduces the cost to , where we select the record in at the beginning of each segment and attend only to the selected record in when decoding every word. For larger input data, our model can be significantly cheaper than neural attention models.

5 Experiment Setup


We conduct experiments on the E2E Novikova et al. (2017b) and WebNLG Colin et al. (2016) datasets. E2E is a crowd-sourced dataset containing 50k instances in the restaurant domain. The inputs are dialogue acts consisting of three to eight slot-value pairs. WebNLG contains 25k instances describing entities belonging to fifteen distinct DBpedia categories. The inputs are up to seven RDF triples of the form (subject, relation, object).

Implementation Details

We use a bi-directional LSTM encoder and uni-directional LSTM decoder for all experiments. Input data records are concatenated into a sequence and fed into the encoder. We choose the hidden size of encoder/decoder as 512 for E2E and 256 for WebNLG. The word embedding is with size 100 for both datasets and initialized with the pre-trained Glove embedding Pennington et al. (2014). We use a drop out rate of for both the encoder and decoder. Models are trained using the Adam optimizer Kingma and Ba (2014)

with batch size 64. The learning rate is initialized to 0.01 and decays an order of magnitude once the validation loss increases. All hyperparameters are chosen with grid search according to the validation loss. Models are implemented based on the open-source library PyTorch 

Paszke et al. (2019). We set the hyperparameters in Eq. 5 as (recall that is the number of records in the input data). The intuition is that every text is expected to realize the content of all input records. It is natural to assume every text can be roughly segmented into fragments, each corresponding to one data record. A deviation of is allowed for noisy data or text with complex structures.


We measure the quality of system outputs from three perspectives: (1) word-level overlap with human references, which is a commonly used metric for text generation. We report the scores of BLEU-4 Papineni et al. (2002), ROUGE-L Lin (2004), Meteor Banerjee and Lavie (2005) and CIDEr Vedantam et al. (2015) . (2) human evaluation. Word-level overlapping scores usually correlate rather poorly with human judgements on fluency and information accuracy Reiter and Belz (2009); Novikova et al. (2017a). Therefore, we passed the input data and generated text to human annotators to judge if the text is fluent by grammar (scale 1-5 as in Belz and Reiter (2006)), contains wrong fact inconsistent with input data, repeats or misses information. We report the averaged score for fluency and definite numbers for others. The human is conducted on a sampled subset from the test data. To ensure the subset covers inputs with all possible number of records ( for E2E and for WebNLG), we sample 20 instances for every possible . Finally,we obtain 120 test cases for E2E and 140 for WebNLG 777The original human evaluation subset of WebNLG is randomly sampled, most of the inputs contain less than 3 records, so we opt for a new sample for a thorough evaluation.. (3) Diversity of outputs. Diversity is an important concern for many real-life applications. We measure it by the number of unique unigrams and trigrams over system outputs, as done in Dušek et al. (2020).

6 Results

Figure 3: Average expected number of segments with varying hyperparameters. x-axis is the encoder/decoder hidden size and y-axis is the word embedding size. Upper two figures are without the granularity regularization and the bottom two are with regularization.
Metrics Word Overlap Human Evaluation Diversity
Models BLEU R-L Meteor CIDEr Fluent Wrong Repeat Miss Dist-1 Dist-3
Slug 0.662 0.677 0.445 2.262 4.94 5 0 17 74 507
DANGNT 0.599 0.663 0.435 2.078 4.97 0 0 21 61 301
TUDA 0.566 0.661 0.453 1.821 4.98 0 0 10 57 143
N_Temp 0.598 0.650 0.388 1.950 4.84 19 3 35 119 795
PG 0.638 0.677 0.449 2.123 4.91 15 1 29 133 822
Ours 0.647 0.683 0.453 2.222 4.96 0 1 15 127 870
Ours (+R) 0.645 0.681 0.452 2.218 4.95 0 0 13 133 881
Ours (+RM) 0.651 0.682 0.455 2.241 4.95 0 0 3 135 911
Table 2: Automatic and human evaluation results on E2E dataset. SLUG, DANGNT, TUDA and N_TEMP are from previous works and the other models are our own implementations.
Metrics Word Overlap Human Evaluation Diversity
Models BLEU R-L Meteor CIDEr Fluent Wrong Repeat Miss Dist-1 Dist-3
Melbourne 0.450 0.635 0.376 2.814 4.16 42 22 37 3167 13,744
UPF-FORGe 0.385 0.609 0.390 2.500 4.08 29 6 28 3191 12,509
PG 0.452 0.652 0.384 2.623 4.13 43 26 42 3,218 13,403
Ours 0.453 0.656 0.388 2.610 4.23 26 19 31 3,377 14,516
Ours (+R) 0.456 0.657 0.390 2.678 4.28 18 2 24 3,405 14,351
Ours (+RM) 0.461 0.654 0.398 2.639 4.26 23 4 5 3,457 14,981
Table 3: Automatic and human evaluation results on WebNLG dataset. MELBOURNE and UPFUPF-FORGE are from previous works and the other models are our own implementations.
Egg Harbor Township, New Jersey isPartOf New Jersey Atlantic City International Airport Location Identifier “KACY” ICAO
Atlantic City International Airport location Egg Harbor Township, New Jersey Egg Harbor Township, New Jersey country United States
Egg Harbor Township, New Jersey isPartOf Atlantic County, New Jersey
PG Atlantic City International Airport is located in Egg Harbor Township , New Jersey , United States . It is located in Egg Harbor Township , New Jersey .
Ours KACY is the ICAO location identifier of Atlantic City International Airport , which is located at Egg Harbor Township , New jersey, in the United States] . The ICAO location identifier of Atlantic City International Airport is KACY .
Ours (+R) KACY is the ICAO location identifier of Atlantic City International Airport , which is located at Egg Harbor Township , New jersey, in the United States] .
Ours (+RM) KACY is the ICAO location identifier of Atlantic City International Airport ,which is located at Egg Harbor Township , New jersey, in the United States .The Egg Harbor Township is a part of Atlantic County , New Jersey . Egg Harbor Township is a part of New Jersey .
Figure 4: Example generations from WebNLG. Relation types are underlined and repeated generations are bolded. Segments and corresponding records in our model are marked in the same color. By adding explicit constraints to the decoding process, repetition and missing issues can be largely reduced. (better viewed in color)
Input: [name the mill][eattype restaurant][food english][pricerange moderate][customerrating 1 out of 5][area riverside]
PG: the mill is a low - priced restaurant in the city centre that delivers take - away . it is located near café rouge.
Input: [name the mill][eattype restaurant][food english][pricerange moderate][customerrating 1 out of 5][areariverside] …
Ours: [the mill][restaurant][near café rouge][in riverside][serves english food][at moderate prices][. it is kid friendly and]…
Table 4: (E2E) Attention map when decoding the word “low” in the PG model and “moderate” in our model. Hallucinated contents are bolded. The PG model wrongly attended to other slots thereby “hallucinated” the content of “low-priced”. Our model always attends to one single slot instead of averaging over the whole inputs, the chance of hallucination is largely reduced.

In this section, we first show the effects of the granularity regularization we proposed, then compare model performance on two datasets and analyze the performance difference. Our model is compared against the neural attention-based pointer generator (PG) which does not explicit learn the segmentation and correspondence. To show the effects of the constrained decoding mentioned in section 4, Decoding. we run our model with only the first constraint to prevent empty segments (denoted by ours in experiments), with the first two constraints to prevent repetition (denoted by ours (+R)), and with all constraints to further reduce information missing (denoted by ours (+RM)).

Segmentation Granularity

We show the effects of the granularity regularization (section 4, Segmentation Granularity) in Fig 3

. When varying the model size, the segmentation granularity changes much if no regularization is imposed. Intuitively if the generation module is strong enough (larger hidden size), it can accurately estimate the sentence likelihood itself without paying extra cost of switching between segments, then it tends to reduce the number of transitions. Vice versa, the number of transitions will grow if the transition module is stronger (larger embedding size). With the regularization we proposed, the granularity remains what we want regardless of the hyperparameters. We can thereby freely decide the model capacity without worrying about the difference of segmentation behavior.

Results on E2E

On the E2E dataset, apart from our implementations, we also compare agianst outputs from the SLUG Juraska et al. (2018), the overall winner of the E2E challenge (seq2seq-based), DANGNT Nguyen and Tran (2018), the best grammar rule based model, TUDA Puzikov and Gurevych (2018), the best template based model, and the autoregressive neural template model (N_TEMP) from Wiseman et al. (2018). SLUG uses a heuristic slot aligner based on a set of handcrafted rules and combine a complex pipeline of data augmentation, selection, model ensemble and reranker, while our model has a simple end-to-end learning paradigm with no special delexicalizing, training or decoding tricks. Table 2

reports the evaluated results. Seq2seq-based models are more diverse than rule-based models at the cost of higher chances of making errors. As rule-based systems are by design always faithful to the input information, they made zero wrong facts in their outputs. Most models do not have the fact repetition issue because of the relatively simple patterns in the E2E dataset. therefore, adding the (+R) constraint only improves the performance minorly. The (+RM) constraint reduces the number of information missing to 3 without hurting the fluency. All the 3 missing cases are because of the wrong alignment between the period and one data record, which can be easily fixed by defining a simple rule. We put the error analysis in

appendix A. N_Temp performs worst among all seq2seq-based systems because of the restrictions we mentioned in section 2. As also noted by the author, it trades-off the generation quality for interpretability and controllability. In contrast, our model, despite relying on no heuristics or complex pipelines, made zero wrong facts with the lowest information missing rate, even surpassing rule-based models. It also maintains interpretable and controllable without sacrificing the generation quality.

Results on WebNLG

Table 3 reports the results evaluated on the WebNLG dataset. We also include results from MELBOURNE, a seq2seq-based system achieving highest scores on automatic metrics in the WebNLG challenge and UPF-FORGE, a classic grammar-based system that wins in the human evaluation WebNLG contains significantly more distinct types of attributes than E2E, so the chance of making errors or repetitions increases greatly. Nevertheless, our model still performs on-par on automatic metrics with superior information adequacy and output diversity. The (+R) decoding constraint becomes important since the outputs in WebNLG are much longer than those in E2E, neural network models have problems tracking the history generation beyond certain range. Models might repeat facts that have been already generated long back before. The (+R) constraint effectively reduces the repetition cases from 19 to 2. These 2 cases are intra-segment repetitions and failed to be detected since our model can only track inter-segment constraints (examples are in appendix A). The (+RM) constraint brings down the information missing cases to 5 with slightly more wrong and repeated facts compared with (+R). Forcing models to keep generating until coveraging all records will inevitably increase the risk of making errors.


In summary, our models generates most diverse outputs, achieves similar or better performances in word-overlap automatic metrics while significantly reduces the information hallucination, repetition and missing problems. An example of hallucination is shown in Table 4. The standard PG model “hallucinated” the contents of “low-priced”, “in the city center” and “delivers take-away”. The visualized attention maps reveal that it failed to attend properly when decoding the word “low”. The decoding is driven mostly by language models instead of the contents of input data. In contrast, as we explicitly align each segment to one slot, the attention distribution of our model is concentrated on one single slot rather than averaged over the whole input, the chance of hallucinating is therefore largely reduced.

Figure 4 shows some example generations from WebNLG. Without adding the decoding constraints, PG and our model both suffer from the problem of information repetition and missing. However, the interpretability of our model enables us to easily avoid these issues by constraining the segment transition behavior. For the attention-based PG model, there exists no simple way of applying these constraints. We can also explicitly control the output structure similar to Wiseman et al. (2018), examples are shown in appendix B.

7 Conclusion

In this work, we exploit the segmental structure in data-to-text generation. The proposed model significantly alleviates the information hallucination, repetition and missing problems without sacrificing the fluency and diversity. It is end-to-end trainable, domain-independent and allows explicit control over the structure of generated text. As our model is interpretable in the correspondence between segments and input records, it can be easily combined with hand-engineered heuristics or user-specific requirements to further improve the performance.

8 Acknowledgements

This research was funded in part by the DFG collaborative research center SFB 1102. Ernie Chang is supported by SFB 248 “Foundations of Perspicuous Software Systems” (E2); Xiaoyu Shen is supported by IMPRS-CS fellowship. We sincerely thank the anonymous reviewers for their insightful comments that helped us to improve this paper.


  • Angeli et al. (2010) Gabor Angeli, Percy Liang, and Dan Klein. 2010. A simple domain-independent probabilistic approach to generation. In

    Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing

    , pages 502–512. Association for Computational Linguistics.
  • Backes et al. (2018) Michael Backes, Pascal Berrang, Mathias Humbert, Xiaoyu Shen, and Verena Wolf. 2018. Simulating the large-scale erosion of genomic privacy over time. IEEE/ACM transactions on computational biology and bioinformatics, 15(5):1405–1412.
  • Bahdanau et al. (2015) Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate. In International Conference on Learning Representations.
  • Balakrishnan et al. (2019) Anusha Balakrishnan, Jinfeng Rao, Kartikeya Upasani, Michael White, and Rajen Subba. 2019. Constrained decoding for neural NLG from compositional representations in task-oriented dialogue. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 831–844, Florence, Italy. Association for Computational Linguistics.
  • Banerjee and Lavie (2005) Satanjeev Banerjee and Alon Lavie. 2005. Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pages 65–72.
  • Barzilay and Lapata (2005) Regina Barzilay and Mirella Lapata. 2005. Collective content selection for concept-to-text generation. In Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing, pages 331–338. Association for Computational Linguistics.
  • Belz (2008) Anja Belz. 2008. Automatic generation of weather forecast texts using comprehensive probabilistic generation-space models. Natural Language Engineering, 14(4):431–455.
  • Belz and Reiter (2006) Anja Belz and Ehud Reiter. 2006. Comparing automatic and human evaluation of nlg systems. In 11th Conference of the European Chapter of the Association for Computational Linguistics.
  • Brown et al. (1993) Peter F Brown, Vincent J Della Pietra, Stephen A Della Pietra, and Robert L Mercer. 1993. The mathematics of statistical machine translation: Parameter estimation. Computational linguistics, 19(2):263–311.
  • Chang et al. (2020) Ernie Chang, David Ifeoluwa Adelani, Xiaoyu Shen, and Vera Demberg. 2020. Unsupervised pidgin text generation by pivoting english data and self-training. arXiv preprint arXiv:2003.08272.
  • Chen et al. (2018) Mingje Chen, Gerasimos Lampouras, and Andreas Vlachos. 2018. Sheffield at e2e: structured prediction approaches to end-to-end language generation. arxiv.
  • Chiu and Raffel (2018) Chung-Cheng Chiu and Colin Raffel. 2018. Monotonic chunkwise attention. ICLR.
  • Colin et al. (2016) Emilie Colin, Claire Gardent, Yassine M’rabet, Shashi Narayan, and Laura Perez-Beltrachini. 2016. The WebNLG challenge: Generating text from DBPedia data. In Proceedings of the 9th International Natural Language Generation conference, pages 163–167, Edinburgh, UK. Association for Computational Linguistics.
  • Daumé III and Marcu (2005) Hal Daumé III and Daniel Marcu. 2005.

    Induction of word and phrase alignments for automatic document summarization.

    Computational Linguistics, 31(4):505–530.
  • Deng et al. (2017) Yuntian Deng, Anssi Kanervisto, Jeffrey Ling, and Alexander M Rush. 2017. Image-to-markup generation with coarse-to-fine attention. In

    Proceedings of the 34th International Conference on Machine Learning-Volume 70

    , pages 980–989. JMLR. org.
  • Deng et al. (2018) Yuntian Deng, Yoon Kim, Justin Chiu, Demi Guo, and Alexander Rush. 2018. Latent alignment and variational attention. In Advances in Neural Information Processing Systems, pages 9712–9724.
  • Dušek et al. (2018) Ondřej Dušek, Jekaterina Novikova, and Verena Rieser. 2018. Findings of the e2e nlg challenge. In Proceedings of the 11th International Conference on Natural Language Generation, pages 322–328.
  • Dušek et al. (2020) Ondřej Dušek, Jekaterina Novikova, and Verena Rieser. 2020. Evaluating the state-of-the-art of end-to-end natural language generation: The e2e nlg challenge. Computer Speech & Language, 59:123–156.
  • Eisner (2002) Jason Eisner. 2002. Parameter estimation for probabilistic finite-state transducers. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 1–8.
  • Eisner (2016) Jason Eisner. 2016. Inside-outside and forward-backward algorithms are just backprop (tutorial paper). In Proceedings of the Workshop on Structured Prediction for NLP, pages 1–17.
  • Ferreira et al. (2019) Thiago Castro Ferreira, Chris van der Lee, Emiel van Miltenburg, and Emiel Krahmer. 2019. Neural data-to-text generation: A comparison between pipeline and end-to-end architectures. EMNLP.
  • Ganchev et al. (2010) Kuzman Ganchev, Jennifer Gillenwater, Ben Taskar, et al. 2010. Posterior regularization for structured latent variable models. Journal of Machine Learning Research, 11(Jul):2001–2049.
  • Huang et al. (2018) Po-Sen Huang, Chong Wang, Sitao Huang, Dengyong Zhou, and Li Deng. 2018. Towards neural phrase-based machine translation. ICLR.
  • Juraska et al. (2018) Juraj Juraska, Panagiotis Karagiannis, Kevin Bowden, and Marilyn Walker. 2018. A deep ensemble model with slot alignment for sequence-to-sequence natural language generation. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 152–162.
  • Kawakami et al. (2019) Kazuya Kawakami, Chris Dyer, and Phil Blunsom. 2019. Unsupervised word discovery with segmental neural language models. ACL.
  • Kim and Mooney (2010) Joohyun Kim and Raymond J Mooney. 2010. Generative alignment and semantic parsing for learning from ambiguous supervision. In Proceedings of the 23rd International Conference on Computational Linguistics: Posters, pages 543–551. Association for Computational Linguistics.
  • Kim et al. (2017) Yoon Kim, Carl Denton, Luong Hoang, and Alexander M Rush. 2017. Structured attention networks. ICLR.
  • Kim et al. (2018) Yoon Kim, Sam Wiseman, and Alexander M Rush. 2018. A tutorial on deep latent variable models of natural language. arXiv preprint arXiv:1812.06834.
  • Kingma and Ba (2014) Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. ICLR.
  • Koncel-Kedziorski et al. (2014) Rik Koncel-Kedziorski, Hannaneh Hajishirzi, and Ali Farhadi. 2014. Multi-resolution language grounding with weak supervision. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 386–396.
  • Kondadadi et al. (2013) Ravi Kondadadi, Blake Howald, and Frank Schilder. 2013. A statistical nlg framework for aggregated planning and realization. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 1406–1415.
  • Konstas and Lapata (2013) Ioannis Konstas and Mirella Lapata. 2013. A global model for concept-to-text generation.

    Journal of Artificial Intelligence Research

    , 48:305–346.
  • Kukich (1983) Karen Kukich. 1983. Design of a knowledge-based report generator. In Proceedings of the 21st annual meeting on Association for Computational Linguistics, pages 145–150. Association for Computational Linguistics.
  • Lebret et al. (2016) Rémi Lebret, David Grangier, and Michael Auli. 2016. Neural text generation from structured data with application to the biography domain. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 1203–1213. Association for Computational Linguistics.
  • Liang et al. (2009) Percy Liang, Michael I Jordan, and Dan Klein. 2009. Learning semantic correspondences with less supervision. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 1-Volume 1, pages 91–99. Association for Computational Linguistics.
  • Lin (2004) Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pages 74–81.
  • Ling and Rush (2017) Jeffrey Ling and Alexander Rush. 2017. Coarse-to-fine attention models for document summarization. In Proceedings of the Workshop on New Frontiers in Summarization, pages 33–42.
  • Luong et al. (2015) Thang Luong, Hieu Pham, and Christopher D. Manning. 2015. Effective approaches to attention-based neural machine translation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 1412–1421. Association for Computational Linguistics.
  • McKeown (1992) Kathleen McKeown. 1992. Text generation. Cambridge University Press.
  • Mei et al. (2016) Hongyuan Mei, TTI UChicago, Mohit Bansal, and Matthew R Walter. 2016. What to talk about and how? selective generation using lstms with coarse-to-fine alignment. In Proceedings of NAACL-HLT, pages 720–730.
  • Moryossef et al. (2019) Amit Moryossef, Yoav Goldberg, and Ido Dagan. 2019. Step-by-step: Separating planning from realization in neural data-to-text generation. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 2267–2277.
  • Nguyen and Tran (2018) Dang Tuan Nguyen and Trung Tran. 2018. Structurebased generation system for e2e nlg challenge. arxiv.
  • Novikova et al. (2017a) Jekaterina Novikova, Ondřej Dušek, Amanda Cercas Curry, and Verena Rieser. 2017a.

    Why we need new evaluation metrics for nlg.

    In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2241–2252.
  • Novikova et al. (2017b) Jekaterina Novikova, Ondřej Dušek, and Verena Rieser. 2017b. The e2e dataset: New challenges for end-to-end generation. In Proceedings of the 18th Annual SIGdial Meeting on Discourse and Dialogue, pages 201–206.
  • Och et al. (1999) Franz Josef Och, Christoph Tillmann, and Hermann Ney. 1999. Improved alignment models for statistical machine translation. In 1999 Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora.
  • Oya et al. (2014) Tatsuro Oya, Yashar Mehdad, Giuseppe Carenini, and Raymond Ng. 2014. A template-based abstractive meeting summarization: Leveraging summary and source text relationships. In Proceedings of the 8th International Natural Language Generation Conference (INLG), pages 45–53.
  • Papineni et al. (2002) Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics, pages 311–318. Association for Computational Linguistics.
  • Paszke et al. (2019) Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. 2019.

    Pytorch: An imperative style, high-performance deep learning library.

    In Advances in Neural Information Processing Systems, pages 8024–8035.
  • Pennington et al. (2014) Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 1532–1543.
  • Puduppully et al. (2019) Ratish Puduppully, Li Dong, and Mirella Lapata. 2019. Data-to-text generation with content selection and planning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 6908–6915.
  • Puzikov and Gurevych (2018) Yevgeniy Puzikov and Iryna Gurevych. 2018. E2e nlg challenge: Neural models vs. templates. In Proceedings of the 11th International Conference on Natural Language Generation, pages 463–471.
  • Qin et al. (2018) Guanghui Qin, Jin-Ge Yao, Xuening Wang, Jinpeng Wang, and Chin-Yew Lin. 2018. Learning latent semantic annotations for grounding natural language to structured data. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 3761–3771.
  • Rabiner (1989) Lawrence R Rabiner. 1989.

    A tutorial on hidden markov models and selected applications in speech recognition.

    Proceedings of the IEEE, 77(2):257–286.
  • Reed et al. (2018) Lena Reed, Shereen Oraby, and Marilyn Walker. 2018. Can neural generators for dialogue learn sentence planning and discourse structuring? In Proceedings of the 11th International Conference on Natural Language Generation, pages 284–295.
  • Reiter and Belz (2009) Ehud Reiter and Anja Belz. 2009. An investigation into the validity of some metrics for automatically evaluating natural language generation systems. Computational Linguistics, 35(4):529–558.
  • Reiter and Dale (1997) Ehud Reiter and Robert Dale. 1997. Building applied natural language generation systems. Natural Language Engineering, 3(1):57–87.
  • See et al. (2017) Abigail See, Peter J. Liu, and Christopher D. Manning. 2017. Get to the point: Summarization with pointer-generator networks. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1073–1083. Association for Computational Linguistics.
  • Shankar and Sarawagi (2019) Shiv Shankar and Sunita Sarawagi. 2019. Posterior attention models for sequence to sequence learning. ICLR.
  • Shen et al. (2017) Xiaoyu Shen, Youssef Oualil, Clayton Greenberg, Mittul Singh, and Dietrich Klakow. 2017. Estimation of gap between current language models and human performance. Proc. Interspeech 2017, pages 553–557.
  • Shen et al. (2019a) Xiaoyu Shen, Jun Suzuki, Kentaro Inui, Hui Su, Dietrich Klakow, and Satoshi Sekine. 2019a. Select and attend: Towards controllable content selection in text generation. arXiv preprint arXiv:1909.04453.
  • Shen et al. (2019b) Xiaoyu Shen, Yang Zhao, Hui Su, and Dietrich Klakow. 2019b. Improving latent alignment in text summarization by generalizing the pointer generator. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3753–3764.
  • Su et al. (2018) Hui Su, Xiaoyu Shen, Wenjie Li, and Dietrich Klakow. 2018. Nexus network: Connecting the preceding and the following in dialogue generation. arXiv preprint arXiv:1810.00671.
  • Su et al. (2019) Hui Su, Xiaoyu Shen, Rongzhi Zhang, Fei Sun, Pengwei Hu, Cheng Niu, and Jie Zhou. 2019. Improving multi-turn dialogue modelling with utterance rewriter. arXiv preprint arXiv:1906.07004.
  • Vedantam et al. (2015) Ramakrishna Vedantam, C Lawrence Zitnick, and Devi Parikh. 2015. Cider: Consensus-based image description evaluation. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    , pages 4566–4575.
  • Wang et al. (2017) Chong Wang, Yining Wang, Po-Sen Huang, Abdelrahman Mohamed, Dengyong Zhou, and Li Deng. 2017. Sequence modeling via segmentations. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 3674–3683. JMLR. org.
  • Wang et al. (2018) Weiyue Wang, Derui Zhu, Tamer Alkhouli, Zixuan Gan, and Hermann Ney. 2018. Neural hidden Markov model for machine translation. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 377–382, Melbourne, Australia. Association for Computational Linguistics.
  • Wiseman et al. (2017) Sam Wiseman, Stuart Shieber, and Alexander Rush. 2017. Challenges in data-to-document generation. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2253–2263.
  • Wiseman et al. (2018) Sam Wiseman, Stuart M Shieber, and Alexander M Rush. 2018. Learning neural templates for text generation. EMNLP.
  • Wu et al. (2018) Shijie Wu, Pamela Shapiro, and Ryan Cotterell. 2018. Hard non-monotonic attention for character-level transduction. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 4425–4438.
  • Yang et al. (2019) Ze Yang, wei wu, Jian Yang, Can Xu, and zhoujun li. 2019. Low-resource response generation with template prior. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 1886–1897, Hong Kong, China. Association for Computational Linguistics.
  • Yu et al. (2016) Lei Yu, Jan Buys, and Phil Blunsom. 2016. Online segment to segment neural transduction. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 1307–1316.
  • Zhao et al. (2018) Yang Zhao, Xiaoyu Shen, Hajime Senuma, and Akiko Aizawa. 2018.

    A comprehensive study: Sentence compression with linguistic knowledge-enhanced gated neural network.

    Data & Knowledge Engineering

    , 117:307–318.

Appendix A Error Analysis

We analyze common errors below.

Missing: Even with the coverage decoding constraint, the model can still occasionally miss information. We show one example in Table 5. The segments cover all input records, but the segment aligned to “familyfriendly” only generates a period symbol. This happens 3 times on E2E and twice on WebNLG. On the other 3 cases of missing on WebNLG, some segments only generate one end-of-sentence symbol. Both conditions can be easily fixed by some simple filtering rules.

Repeating: There are still some repeating cases on the WebNLG dataset. Table 6 shows one example. “amsterdam-centrum is part of amsterdam” is repeated twice within a segment. As our constraint decoding can only prevent inter-segment repetition, it cannot fully avoid the repetition problem resulting from the intra-segment errors of RNNs.

Input: name the phoenix eattype pub food french pricerange £20 - 25 customerrating high area riverside
familyfriendly yes near crowne plaza hotel
Output: the phoenix pub is located in riverside near crowne plaza hotel . it serves french food in the £20 -
25 price range . it has a high customer rating .
Table 5: Example of missing in E2E. The “familyfriendly” is wligned to the period symbol.
Input: Amsterdam ground AFC Ajax (amateurs) Eberhard van der Laan leader Amsterdam Amsterdam-Centrum part Amsterdam
Output: amsterdam-centrum is part of amsterdam and amsterdam-centrum is part of amsterdam , the country where eberhard van
der laan is the leader and the ground of afc ajax ( amateurs ) is located.
Table 6: (E2E) Example of repeatition in WebNLG. The phrase “amsterdam-centrum is part of amsterdam” is repeated twice.

Appendix B Controlling output structure

As our model learns interpretable correspondence of each segment, it can control the output structures same as in Wiseman et al. (2018). Table 7 shows example generations by sampling diverse segment structures.

Input: name the phoenix eattype pub food french pricerange £20 - 25 customerrating high area riverside
familyfriendly yes near crowne plaza hotel
Output1: the phoenix pub is located in riverside near crowne plaza hotel . it serves french food in the £20 -
25 price range . it has a high customer rating .
Output2: the phoenix is located in riverside near crowne plaza hotel . it is a family - friendly french pub with
the price range of £20 - 25 . it has a high customer rating .
Output3: located in riverside near crowne plaza hotel , the phoenix is a french pub with a high customer rating
and a price range of £20 - 25. It is family - friendly .
Table 7: Example of generations with diverse structures.