Implementation of the paper -> https://arxiv.org/abs/1709.00155. For converting information present in the form of structured data into natural language text
Generating texts from structured data (e.g., a table) is important for various natural language processing tasks such as question answering and dialog systems. In recent studies, researchers use neural language models and encoder-decoder frameworks for table-to-text generation. However, these neural network-based approaches do not model the order of contents during text generation. When a human writes a summary based on a given table, he or she would probably consider the content order before wording. In a biography, for example, the nationality of a person is typically mentioned before occupation in a biography. In this paper, we propose an order-planning text generation model to capture the relationship between different fields and use such relationship to make the generated text more fluent and smooth. We conducted experiments on the WikiBio dataset and achieve significantly higher performance than previous methods in terms of BLEU, ROUGE, and NIST scores.READ FULL TEXT VIEW PDF
A number of researchers have recently questioned the necessity of
Text generation from a knowledge base aims to translate knowledge triple...
Neural data-to-text generation models have achieved significant advancem...
Table-to-text generation aims to translate the structured data into the
Generating text from structured data is important for various tasks such...
Data-to-Text Generation (DTG) is a subfield of Natural Language Generati...
Automatically constructed datasets for generating text from semi-structu...
Implementation of the paper -> https://arxiv.org/abs/1709.00155. For converting information present in the form of structured data into natural language text
Generating texts from structured data (e.g., a table) is important for various natural language processing tasks such as question answering and dialog systems. Table 1 shows an example of a Wikipedia infobox (containing fields and values) and a text summary.
In early years, text generation is usually accomplished by human-designed rules and templates [Green2006, Turner, Sripada, and Reiter2010], and hence the generated texts are not flexible. Recently, researchers apply neural networks to generate texts from structured data [Lebret, Grangier, and Auli2016]
, where a neural encoder captures table information and a recurrent neural network (RNN) decodes these information to a natural language sentence.
Although such neural network-based approach is capable of capturing complicated language and can be trained in an end-to-end fashion, it lacks explicit modeling of content order during text generation. That is to say, an RNN generates a word at a time step conditioned on previous generated words as well as table information, which is more or less “shortsighted” and differs from how a human writer does. As suggested in the book The Elements of Style,
A basic structural design underlies every kind of writing … in most cases, planning must be a deliberate prelude to writing. [William and White1999]
This motivates order planning for neural text generation. In other words, a neural network should model not only word order (as has been well captured by RNN) but also the order of contents, i.e., fields in a table.
We also observe from real summaries that table fields by themselves provide illuminating clues and constraints of text generation. In the biography domain, for example, the nationality of a person is typically mentioned before the occupation. This could benefit from explicit planning of content order during neural text generation.
In this paper, we propose an order-planning method for table-to-text generation. Our model is built upon the encoder-decoder framework and use RNN for text synthesis with attention to table entries. Different from exiting neural models, we design a table field linking mechanism, inspired by temporal memory linkage in the Differentiable Neural Computer [Graves et al.2016, DNC]. Our field linking mechanism explicitly models the relationship between different fields, enabling our neural network to better plan what to say first and what next. Further, we incorporate a copy mechanism [Gu et al.2016] into our model to cope with rare words.
We evaluated our method on the WikiBio dataset [Lebret, Grangier, and Auli2016]. Experimental results show that our order-planning approach significantly outperforms previous state-of-the-art results in terms of BLEU, ROUGE, and NIST metrics. Extensive ablation tests verify the effectiveness of each component in our model; we also perform visualization analysis to better understand the proposed order-planning mechanism.
Our model takes as input a table (e.g., a Wikipedia infobox) and generates a natural language summary describing the information based on an RNN. The neural network contains three main components:
An encoder captures table information;
A dispatcher—a hybrid content- and linkage-based attention mechanism over table contents—plans what to generate next; and
A decoder generates a natural language summary using RNN, where we also incorporate a copy mechanism [Gu et al.2016] to cope with rare words.
We elaborate these components in the rest of this section.
We design a neural encoder to represent table information. As shown in Figure 1
, the content of each field is split into separate words and the entire table is transformed into a large sequence. Then we use a recurrent neural network (RNN) with long short term memory (LSTM) units[Hochreiter and Schmidhuber1997] to read the contents as well as their corresponding field names.
Concretely, let be the number of content words in a table; let and be the embeddings of a content and its corresponding field, respectively (). The input of LSTM-RNN is the concatenation of and , denoted as , and the output, denoted as , is the encoded information corresponding to a content word, i.e.,
where denotes element-wise product, and denotes the function. ’s and ’s are weights, and bias terms are omitted in the equations for clarity. , , and are known as input, forget, and output gates.
Notice that, we have two separate embedding matrices for fields and content words. We observe the field names of different data samples mostly come from a fixed set of candidates, which is reasonable in a particular domain. Therefore, we assign an embedding to a field, regardless of the number of words in the field name. For example, the field Notable work in Table 1 is represented by a single field embedding instead of the embeddings of notable and work.
For content words, we represent them with conventional word embeddings (which are randomly initialized), and use LSTM-RNN to integrate information. In a table, some fields contain a sequence of words (e.g., Name=“Arthur Ignatius Conan Doyle”), whereas other fields contain a set of words (e.g., Occupation = “writer, physician”). We do not have much human engineering here, but let an RNN to capture such subtlety by itself.
|(a) Encoder||(b) Dispatcher|
|Table Representation||Planning What to Generate Next|
After encoding table information, we use another RNN to decode a natural language summary (deferred to the next part). During the decoding process, the RNN is augmented with a dispatcher that plans what to generate next.
Generally, a dispatcher is an attention mechanism over table contents. At each decoding time step , the dispatcher computes a probabilistic distribution (), which is further used for weighting content representations . In our model, the dispatcher is a hybrid of content- and link-based attention, discussed in detail as follows.
Traditionally, the computation of attention is based on the content representation as well as some state during decoding [Bahdanau, Cho, and Bengio2015, Mei, Bansal, and Walter2016]. We call this content-based attention, which is also one component in our dispatcher.
Since both the field name and the content contain important clues for text generation, we compute the attention weights based on not only the encoded vector of table contentbut also the field embedding , thus obtaining the final attention by re-weighting one with the other. Formally, we have
where are learnable parameters; and are vector representations of the field name and encoded content, respectively, for the th row. is the content-based attention weights. Ideally, a larger content-based attention indicates a more relevant content to the last generated word.
We further propose a link-based attention mechanism that directly models the relationship between different fields.
Our intuition stems from the observation that, a well-organized text typically has a reasonable order of its contents. As illustrated previously, the nationality of a person is often mentioned before his occupation (e.g., a British writer). Therefore, we propose an link-based attention to explicitly model such order information.
We construct a link matrix , where is the number possible field names in the dataset. An element is a real-valued score indicating how likely the field is mentioned after the field . (Here, indexes a matrix.) The link matrix
is a part of model parameters and learned by backpropagation. Although the link matrix appears to be large in size (14751475), a large number of its elements are not used because most fields do not co-occur in at least one data sample; in total, we have 53422 effective parameters here. In other scenarios, low-rank approximation may be used to reduce the number of parameters.
Formally, let () be an attention probability111Here, refers to the hybrid content- and link-based attention, which will be introduced shortly. over table contents in the last time step during generation. For a particular data sample whose content words are of fields , we first weight the linking scores by the previous attention probability, and then normalize the weighted score to obtain link-based attention probability, given by
Intuitively, the link matrix is analogous to the transition matrix in a Markov chain[Karlin2014], whereas the term is similar to one step of transition in the Markov chain. However, in our scenario, a table in a particular data sample contains only a few fields, but a field may occur several times because it contains more than one content words. Therefore, we do not require our link matrix to be a probabilistic distribution in each row, but normalize the probability afterwards by Equation 9, which turns out to work well empirically.
Besides, we would like to point out that the link-based attention is inspired by the Differentiable Neural Computer [Graves et al.2016, DNC]
. DNC contains a “linkage-based addressing” mechanism to track consecutively used memory slots and thus to integrate order information during memory addressing. Likewise, we design the link-based attention to capture the temporal order of different fields. But different from the linking strength heuristically defined in DNC, the link matrix in our model is directly parameterized and trained in an end-to-end manner.
To combine the above two attention mechanisms, we use a self-adaptive gate by a sigmoid unit
where is a parameter vector. is the last step’s hidden state of the decoder RNN. is the embedding of the word generated in the last step; is the sum of field embeddings weighted by the current step’s field attention . As and emphasize the content and link aspects, respectively, the self-adaptive gate is aware of both. In practice, we find tends to address link-based attention too much and thus adjust it by empirically.
Finally, the hybrid attention, a probabilistic distribution over all content words, is given by
The decoder is an LSTM-RNN that predicts target words in sequence. We also have an attention mechanism [Bahdanau, Cho, and Bengio2015] that summarizes source information, i.e., the table in our scenario, by weighted sum, yielding an attention vector by
is the hidden representation obtained by the table encoder. Asis a probabilistic distribution—determined by both content and link information—over content words, it enables the decoder RNN to focus on relevant information at a time, serving as an order-planning mechanism for table-to-text generation.
Then we concatenate the attention vector and the embedding of the last step’s generated word , and use a single-layer neural network to mix information before feeding to the decoder RNN. In other words, the decoder RNN’s input (denoted as ) is
where and are weights. Similar to Equations 1–4, at a time step during decoding, the decoder RNN yields a hidden representation , based on which a score function is computed suggesting the next word to generate. The score function is computed by
where is the decoder RNN’s state. ( and
are weights.) The score function can be thought of as the input of a softmax layer for classification before being normalized to a probabilistic distribution. We incorporate a copy mechanism[Gu et al.2016] into our approach, and the normalization is accomplished after considering a copying score, introduced as follows.
|Our results||Content attention only||41.38||34.65||8.57|
|Order planning (full model)||43.91||37.15||8.85|
The copy mechanism scores a content word by its hidden representation in the encoder side, indicating how likely the content word is directly copied during target generation. That is,
and is a real number for (the number of content words). Here is a parameter matrix, and is the decoding state.
In other words, when a word appears in the table content, it has a copying score computed as above. If a word occurs multiple times in the table contents, the scores are added, given by
where is a Boolean variable indicating whether the content word is the same as the word we are considering.
Finally, the LSTM score and the copy score are added for a particular word and further normalized to obtain a probabilistic distribution, given by
where refers to the vocabulary list and refers to the set of content words in a particular data sample. In this way, the copy mechanism can either generate a word from the vocabulary or directly copy a word from the source side. This is hepful in our scenario because some fields in a table (e.g., Name) may contain rare or unseen words and the copy mechanism can cope with them naturally.
For simplicity, we use greedy search during inference, i.e., for each time step , the word with the largest probability is chosen, given by . The decoding process terminates when a special symbol eos is generated, indicating the end of a sequence.
Our training objective is the negative log-likelihood of a sentence in the training set.
where is computed by Equation 18. An penalty is also added as most other studies.
Since all the components described above are differentiable, our entire model can be trained end-to-end by backpropagation, and we use Adam [Kingma and Ba2015] for optimization.
We used the newly published WikiBio dataset [Lebret, Grangier, and Auli2016],222https://github.com/DavidGrangier/wikipedia-biography-dataset which contains 728,321 biographies from WikiProject Biography333https://en.wikipedia.org/wiki/Wikipedia:WikiProject_Biography (originally from English Wikipedia, September 2015).
Each data sample comprises an infobox table of field-content pairs, being the input of our system. The generation target is the first sentence in the biography, which follows the setting in previous work [Lebret, Grangier, and Auli2016]. Although only the first sentence is considered in the experiment, the sentence typically serves as a summary of the article. In fact, the target sentence has 26.1 tokens on average, which is actually long. Also, the sentence contains information spanning multiple fields, and hence our order-planning mechanism is useful in this scenario.
We applied the standard data split: 80% for training and 10% for testing, except that model selection was performed on a validaton subset of 1000 samples (based on BLEU-4).
We decapitalized all words and kept a vocabulary size of 20,000 for content words and generation candidates, which also followed previous work [Lebret, Grangier, and Auli2016]. Even with this reasonably large vocabulary size, we had more than 900k out-of-vocabulary words. This rationalizes the use of the copy mechanism.
For the names of table fields, we treated each as a special token. By removing nonsensical fields whose content is “none” and grouping fields occurring less than 100 times as an “Unknown” field, we had 1475 different field names in total.
In our experiments, both words’ and table fields’ embeddings were 400-dimensional and LSTM layers were 500-dimensional. Notice that, a field (e.g., “name”) and a content/generation word (e.g., also “name”), even with the same string, were considered as different tokens; hence, they had different embeddings. We randomly initialized all embeddings, which are tuned during training.
We compared our model with previous results using either traditional language models or neural networks.
Table NLM: wikibio (wikibio) propose an RNN-based model with attention and copy mechanisms. They have several model variants, and we quote the highest reported results.
We report model performance in terms of several metrics, namely BLEU-4, ROUGE-4, and NIST-4, which are computed by standard software, NIST mteval-v13a.pl (for BLEU and NIST) and MSR rouge-1.5.5 (for ROUGE). We did not include the perplexity measure in wikibio (wikibio) because the copy mechanism makes the vocabulary size vary among data samples, and thus the perplexity is not comparable among different approaches.
Table 2 compares the overall performance with previous work. We see that, modern neural networks are considerably better than traditional KN models with or without templates. Moreover, our base model (with content-attention only) outperforms wikibio (wikibio), showing our better engineering efforts. After adding up all proposed components, we obtain +2.5 BLEU and ROUGE improvement and +0.3 NIST improvement, achieving new state-of-the-art results.
Table 3 provides an extensive ablation test to verify the effectiveness of each component in our model. The top half of the table shows the results without the copy mechanism, and the bottom half incorporates the copying score as described previously. We observe that the copy mechasnim is consistently effective with different types of attention.
We then compare content-based attention and link-based attention, as well as their hybrid (also Table 3). The results show that, link-based attention alone is not as effective as content-based attention. However, we achieve better performance if combining them together with an adaptive gate, i.e., the proposed hybrid attention. The results are consistent in both halves of Table 3 (with or without copying) and in terms of all metrics (BLEU, ROUGE, and NIST). This implies that content-based attention and link-based attention do capture different aspects of information, and their hybrid is more suited to the task of table-to-text generation.
We are further interested in the effect of the gate , which balances content-based attention and link-based attention . As defined in Equation 11, the computation of depends on the decoding state as well as table information; hence it is “self-adaptive.” We would like to verify if such adaptiveness is useful. To verify this, we designed a controlled experiment where the gate was manually assigned in advance and fixed during training. In other words, the setting was essentially a (fixed) interpolation between and . Specifically, we tuned from to with a granularity of , and plot BLEU scores as the comparison metric in Figure 3.
As seen, interpolation of content- and link-based attention is generally better than either single mechanism, which again shows the effectiveness of hybrid attention. However, the peak performance of simple interpolation (42.89 BLEU when ) is worse than the self-adaptive gate, implying that our gating mechanism can automatically adjust the importance of and at a particular time based on the current state and input.
We are curious whether the proposed order-planning mechanism is better than other possible ways of using field information. We conducted two controlled experiments as follows. Similar to the proposed approach, we multiplied the attention probability by a field matrix and thus obtained a weighted field embedding. We fed it to either (1) the computation of content-based attention, i.e., Equations 5–6, or (2) the RNN decoder’s input, i.e., Equation 13. In both cases, the last step’s weighted field embedding was concatenated with the embedding of the generated word .
From Table 4, we see that feeding field information to the computation of interferes content attention and leads to performance degradation, and that feeding it to decoder RNN slightly improves model performance. However, both controlled experiments are worse than the proposed method. The results confirm that our order-planning mechanism is indeed useful in modeling the order of fields, outperforming several other approaches that use the same field information in a naïve fashion.
|Feeding field info to…||BLEU||ROUGE||NIST|
|Decoder RNN’s input||41.96||35.07||8.61|
|Hybrid att. (proposed)||43.91||37.15||8.85|
We showcase an example in Table 5. With only content-based attention, the network is confused about when the word American is appropriate in the sentence, and corrupts the phrase former governor of the federal reserve system as appears in the reference. However, when link-based attention is added, the network is more aware of the order between fields “Nationality” and “Occupation,” and generates the nationality American before the occupation economist. This process could also be visualized in Figure 4. Here, we plot our model’s content-based attention, link-based attention and their hybrid. (The content- and link-based attention probabilities may be different from those separately trained in the ablation test.) After generating “emmett john rice ( december 21, 1919 – march 10, 2011 ) was,” content-based attention skips the nationality and focuses more on the occupation. Link-based attention, on the other hand, provides a strong clue suggesting to generate the nationality first and then occupation. In this way, the obtained sentence is more compliant with conventions.
Text generation has long aroused interest in the NLP community due to is wide applications including automated navigation [Dale, Geldof, and Prost2003] and weather forecasting [Reiter et al.2005]. Traditionally, text generation can be divided into several steps [Stent, Prassad, and Walker2004]: content planning defines what information should be conveyed in the generated sentence; (2) sentence planning determines what to generate in each sentence; and (3) surface realization actually generates those sentences with words.
In early years, surface realization is often accomplished by templates [Van Deemter, Theune, and Krahmer2005] or statistically learned (shallow) models, e.g., probabilistic context-free grammar [Belz2008] and language models [Angeli, Liang, and Klein2010]
, with hand-crafted features or rules. Therefore, these methods are weak in terms of the quality of generated sentences. For planning, researchers also apply (shallow) machine learning approaches. collective (collective), for example, model it as a collective classification problem, whereas semimarkov (semimarkov) use a generative semi-Markov model to align text segment and assigned meanings. Generally, planning and realization in the above work are separate and have difficulty in capturing the complexity of language due to the nature of shallow models.
Recently, the recurrent neural network (RNN) is playing a key role in natural language generating. As RNN can automatically capture highly complicated patterns during end-to-end training, it has successful applications including machine translation [Bahdanau, Cho, and Bengio2015], dialog systems [Shang, Lu, and Li2015]
, and text summarization[Tan, Wan, and Xiao2017].
Researchers are then beginning to use RNN for text generation from structured data. mei (mei) propose a coarse-to-fine grained attention mechanism that selects one or more records (e.g., a piece of weather forecast) by a precomputed but fixed probability and then dynamically attends to relevant contents during decoding. wikibio (wikibio) incorporate the copy mechanism [Gu et al.2016] into the generation process. However, the above approaches do not explicitly model the order of contents. It is also nontrivial to combine traditional planning techniques to such end-to-end learned RNN.
Our paper proposes an order-planning approach by designing a hybrid of content- and link-based attention. The model is inspired by hybrid content- and location-based addressing in the Differentiable Neural Computer [Graves et al.2016, DNC], where the location-based addressing is defined heuristically. Instead, we propose a transition-like link matrix that models how likely a field is mentioned after another, which is more suited in our scenario.
Moreover, our entire model is differentiable, and thus the planning and realization steps in traditional language generation can be learned end-to-end in our model.
In this paper, we propose an order-planning neural network that generates texts from a table (Wikipedia infobox). The text generation process is built upon an RNN with attention to table contents. Different from traditional content-based attention, we explicitly model the order of contents by a link matrix, based on which we compute a link-based attention. Then a self-adaptive gate balances the content- and link-based attention mechanisms. We further incorporate a copy mechanism to our model to cope with rare or unseen words.
We evaluated our approach on a newly proposed large scale dataset, WikiBio. Experimental results show that we outperform previous results by a large margin in terms of BLEU, ROUGE, and NIST scores. We also had extensive ablation test showing the effectiveness of the copy mechanism, as well as the hybrid attention of content and linking information. We compared our order-planning mechanism with other possible ways of modeling field; the results confirm that the proposed method is better than feeding field embedding to the network in a naïve fashion. Finally we provide a case study and visualize the attention scores so as to better understand our model.
In future work, we would like to deal with text generation from multiple tables. In particular, we would design hierarchical attention mechanisms that can first select a table containing the information and then select a field for generation, which would improve the attention efficiency. We would also like to apply the proposed method to text generation from other structured data, e.g., a knowledge graph.
We thank Jing He from AdeptMind.ai for helpful discussions on different ways of using field information.
Scalable modified Kneser-Ney language model estimation.In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, volume 2, 690–696.