Bootstrapping Generators from Noisy Data

04/17/2018 ∙ by Laura Perez-Beltrachini, et al. ∙ 0

A core step in statistical data-to-text generation concerns learning correspondences between structured data representations (e.g., facts in a database) and associated texts. In this paper we aim to bootstrap generators from large scale datasets where the data (e.g., DBPedia facts) and related texts (e.g., Wikipedia abstracts) are loosely aligned. We tackle this challenging task by introducing a special-purpose content selection mechanism. We use multi-instance learning to automatically discover correspondences between data and text pairs and show how these can be used to enhance the content signal while training an encoder-decoder architecture. Experimental results demonstrate that models trained with content-specific objectives improve upon a vanilla encoder-decoder which solely relies on soft attention.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

A core step in statistical data-to-text generation concerns learning correspondences between structured data representations (e.g., facts in a database) and paired texts Barzilay and Lapata (2005); Kim and Mooney (2010); Liang et al. (2009). These correspondences describe how data representations are expressed in natural language (content realisation) but also indicate which subset of the data is verbalised in the text (content selection).

Although content selection is traditionally performed by domain experts, recent advances in generation using neural networks

Bahdanau et al. (2015); Ranzato et al. (2016) have led to the use of large scale datasets containing loosely related data and text pairs. A prime example are online data sources like DBPedia Auer et al. (2007) and Wikipedia and their associated texts which are often independently edited. Another example are sports databases and related textual resources. Wiseman et al. Wiseman et al. (2017) recently define a generation task relating statistics of basketball games with commentaries and a blog written by fans.

In this paper, we focus on short text generation from such loosely aligned data-text resources. We work with the biographical subset of the DBPedia and Wikipedia resources where the data corresponds to DBPedia facts and texts are Wikipedia abstracts about people. Figure 1 shows an example for the film-maker Robert Flaherty, the Wikipedia infobox, and the corresponding abstract. We wish to bootstrap a data-to-text generator that learns to verbalise properties about an entity from a loosely related example text. Given the set of properties in Figure (1a) and the related text in Figure (1b), we want to learn verbalisations for those properties that are mentioned in the text and produce a short description like the one in Figure (1c).

(a) (b) Robert Joseph Flaherty, (February 16, 1884 – July 23, 1951) was an American film-maker who directed and produced the first commercially successful feature-length documentary film, Nanook of the North (1922). The film made his reputation and nothing in his later life fully equalled its success, although he continued the development of this new genre of narrative documentary, e.g., with Moana (1926), set in the South Seas, and Man of Aran (1934), filmed in Ireland’s Aran Islands. He is considered the “father” of both the documentary and the ethnographic film. Flaherty was married to writer Frances H. Flaherty from 1914 until his death in 1951. Frances worked on several of her husband’s films, and received an Academy Award nomination for Best Original Story for Louisiana Story (1948).
(c) Robert Joseph Flaherty, (February 16, 1884 – July 23, 1951) was an American film-maker. Flaherty was married to
   Frances H. Flaherty until his death in 1951.
Figure 1: Property-value pairs (a), related biographic abstract (b) for the Wikipedia entity Robert Flaherty, and model verbalisation in italics (c).

In common with previous work Mei et al. (2016); Lebret et al. (2016); Wiseman et al. (2017)

our model draws on insights from neural machine translation

Bahdanau et al. (2015); Sutskever et al. (2014) using an encoder-decoder architecture as its backbone. Lebret et al. (2016) introduce the task of generating biographies from Wikipedia data, however they focus on single sentence generation. We generalize the task to multi-sentence text, and highlight the limitations of the standard attention mechanism which is often used as a proxy for content selection. When exposed to sub-sequences that do not correspond to any facts in the input, the soft attention mechanism will still try to justify the sequence and somehow distribute the attention weights over the input representation Ghader and Monz (2017). The decoder will still memorise high frequency sub-sequences in spite of these not being supported by any facts in the input.

We propose to alleviate these shortcomings via a specific content selection mechanism based on multi-instance learning (MIL; Keeler and Rumelhart, 1992) which automatically discovers correspondences, namely alignments, between data and text pairs. These alignments are then used to modify the generation function during training. We experiment with two frameworks that allow to incorporate alignment information, namely multi-task learning (MTL; Caruana, 1993

) and reinforcement learning (RL;

Williams, 1992). In both cases we define novel objective functions using the learnt alignments. Experimental results using automatic and human-based evaluation show that models trained with content-specific objectives improve upon vanilla encoder-decoder architectures which rely solely on soft attention.

The remainder of this paper is organised as follows. We discuss related work in Section 2 and describe the MIL-based content selection approach in Section 3. We explain how the generator is trained in Section 4 and present evaluation experiments in Section 5. Section 7 concludes the paper.

2 Related Work

Previous attempts to exploit loosely aligned data and text corpora have mostly focused on extracting verbalisation spans for data units. Most approaches work in two stages: initially, data units are aligned with sentences from related corpora using some heuristics and subsequently extra content is discarded in order to retain only text spans verbalising the data. belz2010extracting obtain verbalisation spans using a measure of strength of association between data units and words, walter2013corpus extract textual patterns from paths in dependency trees while mrabet:webnlg16 rely on crowd-sourcing. Perez-Beltrachini and Gardent

Perez-Beltrachini and Gardent (2016) learn shared representations for data units and sentences reduced to subject-predicate-object triples with the aim of extracting verbalisations for knowledge base properties. Our work takes a step further, we not only induce data-to-text alignments but also learn generators that produce short texts verbalising a set of facts.

Our work is closest to recent neural network models which learn generators from independently edited data and text resources. Most previous work Lebret et al. (2016); Chisholm et al. (2017); Sha et al. (2017); Liu et al. (2017)

targets the generation of single sentence biographies from Wikipedia infoboxes, while wiseman-shieber-rush:2017:EMNLP2017 generate game summary documents from a database of basketball games where the input is always the same set of table fields. In contrast, in our scenario, the input data varies from one entity (e.g., athlete) to another (e.g., scientist) and properties might be present or not due to data incompleteness. Moreover, our generator is enhanced with a content selection mechanism based on multi-instance learning. MIL-based techniques have been previously applied to a variety of problems including image retrieval

Maron and Ratan (1998); Zhang et al. (2002), object detection Carbonetto et al. (2008); Cour et al. (2011), text classification Andrews and Hofmann (2004), image captioning Wu et al. (2015); Karpathy and Fei-Fei (2015), paraphrase detection Xu et al. (2014), and information extraction Hoffmann et al. (2011). The application of MIL to content selection is novel to our knowledge.

We show how to incorporate content selection into encoder-decoder architectures following training regimes based on multi-task learning and reinforcement learning. Multi-task learning aims to improve a main task by incorporating joint learning of one or more related auxiliary tasks. It has been applied with success to a variety of sequence-prediction tasks focusing mostly on morphosyntax. Examples include chunking, tagging Collobert et al. (2011); Søgaard and Goldberg (2016); Bjerva et al. (2016); Plank (2016), name error detection Cheng et al. (2015), and machine translation Luong et al. (2016). Reinforcement learning Williams (1992) has also seen popularity as a means of training neural networks to directly optimize a task-specific metric Ranzato et al. (2016) or to inject task-specific knowledge Zhang and Lapata (2017). We are not aware of any work that compares the two training methods directly. Furthermore, our reinforcement learning-based algorithm differs from previous text generation approaches Ranzato et al. (2016); Zhang and Lapata (2017) in that it is applied to documents rather than individual sentences.

3 Bidirectional Content Selection

We consider loosely coupled data and text pairs where the data component is a set of property-values and the related text is a sequence of sentences . We define a mention span as a (possibly discontinuous) subsequence of containing one or several words that verbalise one or more property-value from . For instance, in Figure 1, the mention span “married to Frances H. Flaherty” verbalises the property-value .

In traditional supervised data to text generation tasks, data units (e.g.,  in our particular setting) are either covered by some mention span or do not have any mention span at all in . The latter is a case of content selection where the generator will learn which properties to ignore when generating text from such data. In this work, we consider text components which are independently edited, and will unavoidably contain unaligned spans, i.e., text segments which do not correspond to any property-value in . The phrase “from 1914” in the text in Figure (1b) is such an example. Similarly, the last sentence, talks about Frances’ awards and nominations and this information is not supported by the properties either.

Our model checks content in both directions; it identifies which properties have a corresponding text span (data selection) and also foregrounds (un)aligned text spans (text selection). This knowledge is then used to discourage the generator from producing text not supported by facts in the property set . We view a property set  and its loosely coupled text  as a coarse level, imperfect alignment. From this alignment signal, we want to discover a set of finer grained alignments indicating which mention spans in  align to which properties in . For each pair , we learn an alignment set which contains property-value word pairs. For example, for the properties and in Figure 1, we would like to derive the alignments in Table 1.

Table 1: Example of word-property alignments for the Wikipedia abstract and facts in Figure 1.

We formulate the task of discovering finer-grained word alignments as a multi-instance learning problem Keeler and Rumelhart (1992). We assume that words from the text are positive labels for some property-values but we do not know which ones. For each data-text pair , we derive pairs of the form where is the number of sentences in . We encode property sets and sentences into a common multi-modal -dimensional embedding space. While doing this, we discover finer grained alignments between words and property-values. The intuition is that by learning a high similarity score for a property set and sentence pair , we will also learn the contribution of individual elements (i.e., words and property-values) to the overall similarity score. We will then use this individual contribution as a measure of word and property-value alignment. More concretely, we assume the pair is aligned (or unaligned) if this individual score is above (or below) a given threshold. Across examples like the one shown in Figure (1a-b), we expect the model to learn an alignment between the text span “married to Frances H. Flaherty” and the property-value .

In what follows we describe how we encode pairs and define the similarity function.

Property Set Encoder

As there is no fixed order among the property-value pairs in , we individually encode each one of them. Furthermore, both properties and values may consist of short phrases. For instance, the property and value in Figure 1. We therefore consider property-value pairs as concatenated sequences

and use a bidirectional Long Short-Term Memory Network (LSTM; Hochreiter and Schmidhuber, 1997) network for their encoding. Note that the same network is used for all pairs. Each property-value pair is encoded into a vector representation:


which is the output of the recurrent network at the final time step. We use addition to combine the forward and backward outputs and generate encoding for .

Sentence Encoder

We also use a biLSTM to obtain a representation for the sentence . Each word is represented by the output of the forward and backward networks at time step . A word at position is represented by the concatenation of the forward and backward outputs of the networks at time step :


and each sentence is encoded as a sequence of vectors .

Alignment Objective

Our learning objective seeks to maximise the similarity score between property set  and a sentence  Karpathy and Fei-Fei (2015). This similarity score is in turn defined on top of the similarity scores among property-values in  and words in . Equation (3) defines this similarity function using the dot product. The function seeks to align each word to the best scoring property-value:


Equation (4) defines our objective which encourages related properties and sentences  to have higher similarity than other and :


4 Generator Training

In this section we describe the base generation architecture and explain two alternative ways of using the alignments to guide the training of the model. One approach follows multi-task training where the generator learns to output a sequence of words but also to predict alignment labels for each word. The second approach relies on reinforcement learning for adjusting the probability distribution of word sequences learnt by a standard word prediction training algorithm.

4.1 Encoder-Decoder Base Generator

We follow a standard attention based encoder-decoder architecture for our generator Bahdanau et al. (2015); Luong et al. (2015). Given a set of properties  as input, the model learns to predict an output word sequence  which is a verbalisation of (part of) the input. More precisely, the generation of sequence  is conditioned on input :


The encoder module constitutes an intermediate representation of the input. For this, we use the property-set encoder described in Section 3 which outputs vector representations for a set of property-value pairs. The decoder uses an LSTM and a soft attention mechanism Luong et al. (2015) to generate one word  at a time conditioned on the previous output words and a context vector dynamically created:


where is a neural network with one hidden layer parametrised by , is the output vocabulary size and the hidden unit dimension, over and composed as follows:


where . is the hidden state of the LSTM decoder which summarises :


The dynamic context vector is the weighted sum of the hidden states of the input property set (Equation (9)); and the weights are determined by a dot product attention mechanism:


We initialise the decoder with the averaged sum of the encoded input representations Vinyals et al. (2016). The model is trained to optimize negative log likelihood:


We extend this architecture to multi-sentence texts in a way similar to wiseman-shieber-rush:2017:EMNLP2017. We view the abstract as a single sequence, i.e., all sentences are concatenated. When training, we cut the abstracts in blocks of equal size and perform forward backward iterations for each block (this includes the back-propagation through the encoder). From one block iteration to the next, we initialise the decoder with the last state of the previous block. The block size is a hyperparameter tuned experimentally on the development set.

4.2 Predicting Alignment Labels

The generation of the output sequence is conditioned on the previous words and the input. However, when certain sequences are very common, the language modelling conditional probability will prevail over the input conditioning. For instance, the phrase from 1914 in our running example is very common in contexts that talk about periods of marriage or club membership, and as a result, the language model will output this phrase often, even in cases where there are no supporting facts in the input. The intuition behind multi-task training Caruana (1993) is that it will smooth the probabilities of frequent sequences when trying to simultaneously predict alignment labels.

Using the set of alignments obtained by our content selection model, we associate each word in the training data with a binary label indicating whether it aligns with some property in the input set. Our auxiliary task is to predict given the sequence of previously predicted words and input :


where and the other operands are as defined in Equation (7). We optimise the following auxiliary objective function:


and the combined multi-task objective is the weighted sum of both word prediction and alignment prediction losses:


where controls how much model training will focus on each task. As we will explain in Section 5, we can anneal this value during training in favour of one objective or the other.

4.3 Reinforcement Learning Training

Although the multi-task approach aims to smooth the target distribution, the training process is still driven by the imperfect target text. In other words, at each time step the algorithm feeds the previous word of the target text and evaluates the prediction against the target .

Alternatively, we propose a training approach based on reinforcement learning (Williams 1992) which allows us to define an objective function that does not fully rely on the target text but rather on a revised version of it. In our case, the set of alignments obtained by our content selection model provides a revision for the target text. The advantages of reinforcement learning are twofold: (a) it allows to exploit additional task-specific knowledge Zhang and Lapata (2017) during training, and (b) enables the exploration of other word sequences through sampling. Our setting differs from previous applications of RL Ranzato et al. (2016); Zhang and Lapata (2017) in that the reward function is not computed on the target text but rather on its alignments with the input.

The encoder-decoder model is viewed as an agent whose action space is defined by the set of words in the target vocabulary. At each time step, the encoder-decoder takes action with policy defined by the probability in Equation (6). The agent terminates when it emits the End Of Sequence (EOS) token, at which point the sequence of all actions taken yields the output sequence . This sequence in our task is a short text describing the properties of a given entity. After producing the sequence of actions , the agent receives a reward  and the policy is updated according to this reward.

Reward Function

We define the reward function  on the alignment set . If the output action sequence is precise with respect to the set of alignments , the agent will receive a high reward. Concretely, we define as follows:


where adjusts the reward value  which is the unigram precision of the predicted sequence and the set of words in .

Training Algorithm

We use the REINFORCE algorithm Williams (1992) to learn an agent that maximises the reward function. As this is a gradient descent method, the training loss of a sequence is defined as the negative expected reward:

where is the agent’s policy, i.e., the word distribution produced by the encoder-decoder model (Equation (6)) and is the reward function as defined in Equation (16). The gradient of is given by:


is a baseline linear regression model used to reduce the variance of the gradients during training.

predicts the future reward and is trained by minimizing mean squared error. The input to this predictor is the agent hidden state , however we do not back-propagate the error to . We refer the interested reader to Williams (1992) and Ranzato et al. (2016) for more details.

Document Level Curriculum Learning

Rather than starting from a state given by a random policy, we initialise the agent with a policy learnt by pre-training with the negative log-likelihood objective Ranzato et al. (2016); Zhang and Lapata (2017). The reinforcement learning objective is applied gradually in combination with the log-likelihood objective on each target block subsequence. Recall from Section 4.1 that our document is segmented into blocks of equal size during training which we denote as MaxBlock. When training begins, only the last tokens are predicted by the agent while for the first we still use the negative log-likelihood objective. The number of tokens  predicted by the agent is incremented by

units every 2 epochs. We set

and the training ends when . Since we evaluate the model’s predictions at the block level, the reward function is also evaluated at the block level.

5 Experimental Setup


We evaluated our model on a dataset collated from WikiBio Lebret et al. (2016), a corpus of 728,321 biography articles (their first paragraph) and their infoboxes sampled from the English Wikipedia. We adapted the original dataset in three ways. Firstly, we make use of the entire abstract rather than first sentence. Secondly, we reduced the dataset to examples with a rich set of properties and multi-sentential text. We eliminated examples with less than six property-value pairs and abstracts consisting of one sentence. We also placed a minimum restriction of 23 words in the length of the abstract. We considered abstracts up to a maximum of 12 sentences and property sets with a maximum of 50 property-value pairs. Finally, we associated each abstract with the set of DBPedia properties  corresponding to the abstract’s main entity. As entity classification is available in DBPedia for most entities, we concatenate class information  (whenever available) with the property value, i.e., . In Figure 1, the property value is extended with class information from the DBPedia ontology to .


Numeric date formats were converted to a surface form with month names. Numerical expressions were delexicalised using different tokens created with the property name and position of the delexicalised token on the value sequence. For instance, given the property-value for birth date in Figure (1a), the first sentence in the abstract (Figure (1b)) becomes “ Robert Joseph Flaherty, (February DLX_birth_date_2, DLX_birth_date_4 – July … ”. Years and numbers in the text not found in the values of the property set were replaced with tokens YEAR and NUMERIC.222We exploit these tokens to further adjust the score of the reward function given by Equation (16). Each time the predicted output contains some of these symbols we decrease the reward score by which we empirically set to 0.025 . In a second phase, when creating the input and output vocabularies, and respectively, we delexicalised words  which were absent from the output vocabulary but were attested in the input vocabulary. Again, we created tokens based on the property name and the position of the word in the value sequence. Words not in or were replaced with the symbol UNK. Vocabulary sizes were limited to and for the alignment model and for the generator. We discarded examples where the text contained more than three UNKs (for the content aligner) and five UNKs (for the generator); or more than two UNKs in the property-value (for generation). Finally, we added the empty relation to the property sets.

Table 2 summarises the dataset statistics for the generator. We report the number of abstracts in the dataset (size), the average number of sentences and tokens in the abstracts, and the average number of properties and sentence length in tokens (sent.len). For the content aligner (cf. Section 3), each sentence constitutes a training instance, and as a result the sizes of the train and development sets are 796,446 and 153,096, respectively.

generation train dev test
size 165,324 25,399 23,162
sentences 3.511.99 3.461.94 3.221.72
tokens 74.1343.72 72.8542.54 66.8138.16
properties 14.978.82 14.968.85 21.69.97
sent.len 21.068.87 21.038.85 20.778.74
Table 2: Dataset statistics.

Training Configuration

We adjusted all models’ hyperparameters according to their performance on the development set. The encoders for both content selection and generation models were initialised with GloVe Pennington et al. (2014) pre-trained vectors. The input and hidden unit dimension was set to 200 for content selection and 100 for generation. In all models, we used encoder biLSTMs and decoder LSTM (regularised with a dropout rate of 0.3 Zaremba et al. (2014)) with one layer. Content selection and generation models (base encoder-decoder and MTL) were trained for 20 epochs with the ADAM optimiser Kingma and Ba (2014)

using a learning rate of 0.001. The reinforcement learning model was initialised with the base encoder-decoder model and trained for 35 additional epochs with stochastic gradient descent and a fixed learning rate of 0.001. Block sizes were set to 40 (base), 60 (MTL) and 50 (RL). Weights for the MTL objective were also tuned experimentally; we set

for the first four epochs (training focuses on alignment prediction) and switched to for the remaining epochs.

Content Alignment

We optimized content alignment on the development set against manual alignments. Specifically, two annotators aligned 132 sentences to their infoboxes. We used the Yawat annotation tool Germann (2008)

and followed the alignment guidelines (and evaluation metrics) used in cohn2008constructing. The inter-annotator agreement using macro-averaged f-score was 0.72 (we treated one annotator as the reference and the other one as hypothetical system output).

Alignment sets were extracted from the model’s output (cf. Section 3) by optimizing the threshold where  denotes the similarity between the set of property values and words, and  is empirically set to 0.75; and

are the mean and standard deviation of

 scores across the development set. Each word was aligned to a property-value if their similarity exceeded a threshold of 0.22. Our best content alignment model (Content-Aligner) obtained an f-score of 0.36 on the development set.

We also compared our Content-Aligner against a baseline based on pre-trained word embeddings (EmbeddingsBL). For each pair  we computed the dot product between words in  and properties in  (properties were represented by the the averaged sum of their words’ vectors). Words were aligned to property-values if their similarity exceeded a threshold of 0.4. EmbeddingsBL obtained an f-score of 0.057 against the manual alignments. Finally, we compared the performance of the Content-Aligner at the level of property set and sentence similarity by comparing the average ranking position of correct pairs among 14 distractors, namely rank@15. The Content-Aligner obtained a rank of 1.31, while the EmbeddingsBL model had a rank of 7.99 (lower is better).

6 Results

We compared the performance of an encoder-decoder model trained with the standard negative log-likelihood method (ED), against a model trained with multi-task learning (ED) and reinforcement learning (ED). We also included a template baseline system (Templ) in our evaluation experiments.

The template generator used hand-written rules to realise property-value pairs. As an approximation for content selection, we obtained the 50 more frequent property names from the training set and manually defined content ordering rules with the following criteria. We ordered personal life properties (e.g., or ) based on their most common order of mention in the Wikipedia abstracts. Profession dependent properties (e.g., or ), were assigned an equal ordering but posterior to the personal properties. We manually lexicalised properties into single sentence templates to be concatenated to produce the final text. The template for the property and example verbalisation for the property-value of the entity zanetti are NAME played as POSITION.” and “ Zanetti played as defender.” respectively.

Model Abstract RevAbs
Templ 5.47 6.43
ED 13.46 35.89
ED 13.57 37.18
ED 12.97 35.74
Table 3: BLEU-4 results using the original Wikipedia abstract (Abstract) as reference and crowd-sourced revised abstracts (RevAbs) for template baseline (Templ), standard encoder-decoder model (ED), and our content-based models trained with multi-task learning (ED) and reinforcement learning (ED).

Automatic Evaluation

Table 3 shows the results of automatic evaluation using BLEU-4 Papineni et al. (2002) against the noisy Wikipedia abstracts. Considering these as a gold standard is, however, not entirely satisfactory for two reasons. Firstly, our models generate considerably shorter text and will be penalized for not generating text they were not supposed to generate in the first place. Secondly, the model might try to re-produce what is in the imperfect reference but not supported by the input properties and as a result will be rewarded when it should not. To alleviate this, we crowd-sourced using AMT a revised version of 200 randomly selected abstracts from the test set.

Crowdworkers were shown a Wikipedia infobox with the accompanying abstract and were asked to adjust the text to the content present in the infobox. Annotators were instructed to delete spans which did not have supporting facts and rewrite the remaining parts into a well-formed text. We collected three revised versions for each abstract. Inter-annotator agreement was 81.64 measured as the mean pairwise BLEU-4 amongst AMT workers.

Automatic evaluation results against the revised abstracts are also shown in Table 3. As can be seen, all encoder-decoder based models have a significant advantage over Templ when evaluating against both types of abstracts. The model enabled with the multi-task learning content selection mechanism brings an improvement of 1.29 BLEU-4 over a vanilla encoder-decoder model. Performance of the RL trained model is inferior and close to the ED model. We discuss the reasons for this discrepancy shortly.

To provide a rough comparison with the results reported in lebret-grangier-auli:2016:EMNLP2016, we also computed BLEU-4 on the first sentence of the text generated by our system.333We post-processed system output with Stanford CoreNLP Manning et al. (2014) to extract the first sentence. Recall that their model generates the first sentence of the abstract, whereas we output multi-sentence text. Using the first sentence in the Wikipedia abstract as reference, we obtained a score of 37.29% (ED), 38.42% (ED) and 38.1% (ED) which compare favourably with their best performing model (34.7%0.36).

 System 1 2 3 4  5 Rank
 Templ 12.17 14.33 10.17 15.50 47.83 3.72
 ED 12.83 24.17 24.67 25.17 13.17 3.02
 ED 14.83 26.17 26.17 19.17 13.67 2.90
 ED 14.67 25.00 25.50 24.00 10.83 2.91
 RevAbs 47.00 14.00 12.67 16.17 9.17 2.27
Table 4: Rankings shown as proportions and mean ranks given to systems by human subjects.
 property-set name= dorsey burnette, date= may 2012, bot= blevintron bot, background= solo singer, birth= december 28 , 1932, birth place= memphis, tennessee, death place= {los angeles; canoga park, california}, death= august 19 , 1979, associated acts= the rock and roll trio, hometown= memphis, tennessee, genre= {rock and roll; rockabilly; country music}, occupation= {composer; singer}, instruments= {rockabilly bass; vocals; acoustic guitar}, record labels= {era records; coral records; smash records; imperial records; capitol records; dot records; reprise records}
 RevAbs Dorsey Burnette (December 28 , 1932 – August 19 , 1979) was an american early Rockabilly singer. He was a member of the Rock and Roll Trio.
 Templ Dorsey Burnette (DB) was born in December 28 , 1932. DB was born in Memphis, Tennessee. DB died in August 19 , 1979. DB died in August 19 , 1979. DB died in Canoga Park, California. DB died in los angeles. DB was a composer. DB was a singer. DB ’s genre was Rock and Roll. The background of DB was solo singer. DB ’s genre was Rockabilly. DB worked with the Rock and Roll Trio. DB ’s genre was Country music. DB worked with the Rock and Roll Trio.
 ED Dorsey Burnette (December 28 , 1932 – August 19 , 1979) was an american singer and songwriter. He was a member of the Rock band the band from YEAR to YEAR.
 ED Dorothy Burnette (December 28 , 1932 – August 19 , 1979) was an american country music singer and songwriter. He was a member of the Rock band Roll.
 ED Burnette Burnette (December 28 , 1932 – August 19 , 1979) was an american singer and songwriter. He was born in memphis , Tennessee.
 property-set name= indrani bose, doctoral advisor= chanchal kumar majumdar, alma mater= university of calcutta, birth= 1951-0-0, birth place= kolkata, field= theoretical physics, work institution= bose institute, birth= august 15 , 1951, honours= fna sc, nationality= india, known for= first recipient of stree sakthi science samman award
 RevAbs Indrani Bose (born 1951) is an indian physicist at the Bose institute. Professor Bose obtained her ph.d. from University of Calcutta
 Templ Indrani Bose (IB) was born in year-0-0. IB was born in August 15 , 1951. IB was born in kolkata. IB was a india. IB studied at University of Calcutta. IB was known for First recipient of Stree Sakthi Science Samman Award.
 ED Indrani UNK (born 15 August 1951) is an indian Theoretical physicist and Theoretical physicist. She is the founder and ceo of UNK UNK.
 ED Indrani Bose (born 15 August 1951) is an indian Theoretical physicist. She is a member of the UNK Institute of Science and technology.
 ED Indrani UNK (born 15 August 1951) is an indian Theoretical physicist. She is a member of the Institute of technology ( UNK ).
 property-set name= aaron moores, coach= sarah paton, club= trowbridge asc, birth= may 16 , 1994, birth place= trowbridge, sport= swimming, paralympics= 2012
 RevAbs Aaron Moores (born 16 May 1994) is a british ParalyMpic swiMMer coMpeting in the s14 category , Mainly in the backstroke and breaststroke and after qualifying for the 2012 SuMMer ParalyMpics he won a Silver Medal in the 100 M backstroke.
 Templ Aaron Moores (AM) was born in May 16 , 1994. AM was born in May 16 , 1994. AM was born in Trowbridge.
 ED Donald Moores (born 16 May 1994) is a Paralympic swimmer from the United states. He has competed in the Paralympic Games.
 ED Donald Moores (born 16 May 1994) is an english swimmer. He competed at the 2012 Summer Paralympics.
 ED Donald Moores (born 16 May 1994) is a Paralympic swimmer from the United states. He competed at the dlx_updated_3 Summer Paralympics.
 property-set name= kirill moryganov, height= 183.0, birth= february 7 , 1991, position= defender, height= 1.83, goals= {0; 1}, clubs= fc torpedo moscow, pcupdate= may 28 , 2016, years= {2013; 2012; 2015; 2016; 2010; 2014; 2008; 2009}, team= {fc neftekhimik nizhnekamsk; fc znamya truda orekhovo- zuyevo; fc irtysh omsk; fc vologda; fc torpedo-zil moscow; fc tekstilshchik ivanovo; fc khimki; fc oktan perm, fc ryazan, fc amkar perm}, matches= {16; 10; 3; 4; 9; 0; 30; 7; 15}
 RevAbs Kirill Andreyevich Moryganov (; born 7 February 1991) is a russian professional football player. He plays for fc Irtysh Omsk. He is a Central defender.
 Templ Kirill Moryganov (KM) was born in February 7 , 1991. KM was born in February 7 , 1991. The years of KM was 2013. The years of KM was 2013. KM played for fc Neftekhimik Nizhnekamsk. KM played for fc Znamya Truda Orekhovo- zuyevo. KM scored 1 goals. The years of KM was 2013. KM played for fc Irtysh Omsk. The years of KM was 2013. KM played as Defender. KM played for fc Vologda. KM played for fc Torpedo-zil Moscow. KM played for fc Tekstilshchik Ivanovo. KM scored 1 goals. KM ’s Club was fc Torpedo Moscow. KM played for fc Khimki. The years of KM was 2013. The years of KM was 2013. The years of KM was 2013. KM played for fc Amkar Perm. The years of KM was 2013. KM played for fc Ryazan. KM played for fc Oktan Perm.
 ED Kirill mikhailovich Moryganov (; born February 7 , 1991) is a russian professional football player. He last played for fc Torpedo armavir.
 ED Kirill Moryganov (; born 7 February 1991) is an english professional footballer who plays as a Defender. He plays for fc Neftekhimik Nizhnekamsk.
 ED Kirill viktorovich Moryganov (; born February 7 , 1991) is a russian professional football player. He last played for fc Tekstilshchik Ivanovo.
Table 5: Examples of system output.

Human-Based Evaluation

We further examined differences among systems in a human-based evaluation study. Using AMT, we elicited 3 judgements for the same 200 infobox-abstract pairs we used in the abstract revision study. We compared the output of the templates, the three neural generators and also included one of the human edited abstracts as a gold standard (reference). For each test case, we showed crowdworkers the Wikipedia infobox and five short texts in random order. The annotators were asked to rank each of the texts according to the following criteria: (1) Is the text faithful to the content of the table? and (2) Is the text overall comprehensible and fluent? Ties were allowed only when texts were identical strings. Table 5 presents examples of the texts (and properties) crowdworkers saw.

Table 4 shows, proportionally, how often crowdworkers ranked each system, first, second, and so on. Unsurprisingly, the human authored gold text is considered best (and ranked first 47% of the time). ED is mostly ranked second and third best, followed closely by ED. The vanilla encoder-decoder system ED is mostly forth and Templ is fifth. As shown in the last column of the table (Rank), the ranking of ED is overall slightly better than ED. We further converted the ranks to ratings on a scale of 1 to 5 (assigning ratings 51 to rank placements 15). This allowed us to perform Analysis of Variance (ANOVA) which revealed a reliable effect of system type. Post-hoc Tukey tests showed that all systems were significantly worse than RevAbs and significantly better than Templ (p 0.05). ED is not significantly better than ED but is significantly (p 0.05) different from ED.


The texts generated by ED are shorter compared to the other two neural systems which might affect BLEU-4 scores and also the ratings provided by the annotators. As shown in Table 5 (entity dorsey burnette), ED drops information pertaining to dates or chooses to just verbalise birth place information. In some cases, this is preferable to hallucinating incorrect facts; however, in other cases outputs with more information are rated more favourably. Overall, ED seems to be more detail oriented and faithful to the facts included in the infobox (see dorsey burnette, aaron moores, or kirill moryganov). The template system manages in some specific configurations to verbalise appropriate facts (indrani bose), however, it often fails to verbalise infrequent properties (aaron moores) or focuses on properties which are very frequent in the knowledge base but are rarely found in the abstracts (kirill moryganov).

7 Conclusions

In this paper we focused on the task of bootstrapping generators from large-scale datasets consisting of DBPedia facts and related Wikipedia biography abstracts. We proposed to equip standard encoder-decoder models with an additional content selection mechanism based on multi-instance learning and developed two training regimes, one based on multi-task learning and the other on reinforcement learning. Overall, we find that the proposed content selection mechanism improves the accuracy and fluency of the generated texts. In the future, it would be interesting to investigate a more sophisticated representation of the input Vinyals et al. (2016). It would also make sense for the model to decode hierarchically, taking sequences of words and sentences into account Zhang and Lapata (2014); Lebret et al. (2015).


We thank the NAACL reviewers for their constructive feedback. We also thank Xingxing Zhang, Li Dong and Stefanos Angelidis for useful discussions about implementation details. We gratefully acknowledge the financial support of the European Research Council (award number 681760).


  • Andrews and Hofmann (2004) Stuart Andrews and Thomas Hofmann. 2004. Multiple instance learning via disjunctive programming boosting. In Advances in Neural Information Processing Systems 16, Curran Associates, Inc., pages 65–72.
  • Auer et al. (2007) Sören Auer, Christian Bizer, Georgi Kobilarov, Jens Lehmann, Richard Cyganiak, and Zachary Ives. 2007. DBPedia: A nucleus for a web of open data. The Semantic Web pages 722–735.
  • Bahdanau et al. (2015) Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate. In Proceedings of the International Conference on Learning Represetnations. San Diego, CA.
  • Barzilay and Lapata (2005) Regina Barzilay and Mirella Lapata. 2005. Collective content selection for concept-to-text generation. In

    Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing

    . Vancouver, British Columbia, Canada, pages 331–338.
  • Belz and Kow (2010) Anja Belz and Eric Kow. 2010. Extracting parallel fragments from comparable corpora for data-to-text generation. In Proceedings of the 6th International Natural Language Generation Conference. Association for Computational Linguistics, Ireland, pages 167–171.
  • Bjerva et al. (2016) Johannes Bjerva, Barbara Plank, and Johan Bos. 2016. Semantic tagging with deep residual networks. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers. Osaka, Japan, pages 3531–3541.
  • Carbonetto et al. (2008) Peter Carbonetto, Gyuri Dorkó, Cordelia Schmid, Hendrik Kück, and Nando De Freitas. 2008. Learning to recognize objects with little supervision.

    International Journal of Computer Vision

  • Caruana (1993) Richard Caruana. 1993. Multitask learning: A knowledge-based source of inductive bias. In

    Proceedings of the 10th International Conference on Machine Learning

    . Morgan Kaufmann, pages 41–48.
  • Cheng et al. (2015) Hao Cheng, Hao Fang, and Mari Ostendorf. 2015. Open-domain name error detection using a multi-task RNN. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. Lisbon, Portugal, pages 737–746.
  • Chisholm et al. (2017) Andrew Chisholm, Will Radford, and Ben Hachey. 2017. Learning to generate one-sentence biographies from Wikidata. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers. Valencia, Spain, pages 633–642.
  • Cohn et al. (2008) Trevor Cohn, Chris Callison-Burch, and Mirella Lapata. 2008. Constructing corpora for the development and evaluation of paraphrase systems. Computational Linguistics 34(4):597–614.
  • Collobert et al. (2011) Ronan Ronan Collobert, Jason Weston, Michael Karlen Léon Bottou, Koray Kavukcuoglu, and Pavel Kuksa. 2011. Natural language processing (almost) from scratch. Journal of Machine Learning Research 12(Aug):2493–2537.
  • Cour et al. (2011) Timothee Cour, Ben Sapp, and Ben Taskar. 2011. Learning from partial labels. Journal of Machine Learning Research 12(May):1501–1536.
  • Germann (2008) Ulrich Germann. 2008. Yawat: yet another word alignment tool. In Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies: Demo Session. Columbus, Ohio, pages 20–23.
  • Ghader and Monz (2017) Hamidreza Ghader and Christof Monz. 2017. What does attention in neural machine translation pay attention to? In Proceedings of the 8th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Asian Federation of Natural Language Processing, Taipei, Taiwan, pages 30–39.
  • Hoffmann et al. (2011) Raphael Hoffmann, Congle Zhang, Xiao Ling, Luke Zettlemoyer, and Daniel S Weld. 2011. Knowledge-based weak supervision for information extraction of overlapping relations. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies-Volume 1. Portland, Oregon, USA, pages 541–550.
  • Karpathy and Fei-Fei (2015) Andrej Karpathy and Li Fei-Fei. 2015. Deep visual-semantic alignments for generating image descriptions. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    . pages 3128–3137.
  • Keeler and Rumelhart (1992) Jim Keeler and David E Rumelhart. 1992. A self-organizing integrated segmentation and recognition neural net. In Advances in Neural Information Processing Systems 5. Curran Associates, Inc., pages 496–503.
  • Kim and Mooney (2010) Joohyun Kim and Raymond J. Mooney. 2010. Generative alignment and semantic parsing for learning from ambiguous supervision. In Proceedings of the 23rd International Conference on Computational Linguistics. Beijing, China, pages 543–551.
  • Kingma and Ba (2014) Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 .
  • Lebret et al. (2016) Rémi Lebret, David Grangier, and Michael Auli. 2016. Neural text generation from structured data with application to the biography domain. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. Austin, Texas, pages 1203–1213.
  • Lebret et al. (2015) Rémi Lebret, Pedro O Pinheiro, and Ronan Collobert. 2015. Phrase-based image captioning. arXiv preprint arXiv:1502.03671 .
  • Liang et al. (2009) Percy Liang, Michael Jordan, and Dan Klein. 2009. Learning semantic correspondences with less supervision. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP. Suntec, Singapore, pages 91–99.
  • Liu et al. (2017) Tianyu Liu, Kexiang Wang, Lei Sha, Baobao Chang, and Zhifang Sui. 2017. Table-to-text generation by structure-aware seq2seq learning. arXiv preprint arXiv:1711.09724 .
  • Luong et al. (2016) Minh-Thang Luong, Quoc V. Le, Ilya Sutskever, Oriol Vinyals, and Lukasz Kaiser. 2016. Multi-task sequence to sequence learning. In Proceedings of the International Conference on Learning Representations. San Juan, Puerto Rico.
  • Luong et al. (2015) Thang Luong, Hieu Pham, and Christopher D. Manning. 2015. Effective approaches to attention-based neural machine translation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. Lisbon, Portugal, pages 1412–1421.
  • Manning et al. (2014) Christopher Manning, Mihai Surdeanu, John Bauer, Jenny Finkel, Steven Bethard, and David McClosky. 2014. The Stanford CoreNLP natural language processing toolkit. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations. Baltimore, Maryland, pages 55–60.
  • Maron and Ratan (1998) Oded Maron and Aparna Lakshmi Ratan. 1998.

    Multiple-instance learning for natural scene classification.

    In Proceedings of the 15th International Conference on Machine Learning. San Francisco, California, USA, volume 98, pages 341–349.
  • Mei et al. (2016) Hongyuan Mei, Mohit Bansal, and Matthew R. Walter. 2016. What to talk about and how? selective generation using LSTMs with coarse-to-fine alignment. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. San Diego, California, pages 720–730.
  • Mrabet et al. (2016) Yassine Mrabet, Pavlos Vougiouklis, Halil Kilicoglu, Claire Gardent, Dina Demner-Fushman, Jonathon Hare, and Elena Simperl. 2016. Proceedings of the 2nd International Workshop on Natural Language Generation and the Semantic Web, Association for Computational Linguistics, chapter Aligning Texts and Knowledge Bases with Semantic Sentence Simplification, pages 29–36.
  • Papineni et al. (2002) Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of 40th Annual Meeting of the Association for Computational Linguistics. Philadelphia, Pennsylvania, USA, pages 311–318.
  • Pennington et al. (2014) Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. Doha, Qatar, pages 1532–1543.
  • Perez-Beltrachini and Gardent (2016) Laura Perez-Beltrachini and Claire Gardent. 2016. Learning Embeddings to lexicalise RDF Properties. In Proceedings of the 5th Joint Conference on Lexical and Computational Semantics. Berlin, Germany, pages 219–228.
  • Plank (2016) Barbara Plank. 2016. Keystroke dynamics as signal for shallow syntactic parsing. In Proceedings of the 26th International Conference on Computational Linguistics: Technical Papers. Osaka, Japan, pages 609–619.
  • Ranzato et al. (2016) Marc’Aurelio Ranzato, Summit Chopra, Michael Auli, and Wojciech Zaremba. 2016.

    Sequence level training with recurrent neural networks.

    In Proceedings of the International Conference on Learning Representations. San Juan, Puerto Rico.
  • Sha et al. (2017) Lei Sha, Lili Mou, Tianyu Liu, Pascal Poupart, Sujian Li, Baobao Chang, and Zhifang Sui. 2017. Order-planning neural text generation from structured data. arXiv preprint arXiv:1709.00155 .
  • Søgaard and Goldberg (2016) Anders Søgaard and Yoav Goldberg. 2016. Deep multi-task learning with low level tasks supervised at lower layers. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). Berlin, Germany, pages 231–235.
  • Sutskever et al. (2014) Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems 27. Curran Associates, Inc., pages 3104–3112.
  • Vinyals et al. (2016) Oriol Vinyals, Samy Bengio, and Manjunath Kudlur. 2016. Order matters: Sequence to sequence for sets. In Proceedings of the International Conference on Learning Representations. San Juan, Puerto Rico.
  • Walter et al. (2013) Sebastian Walter, Christina Unger, and Philipp Cimiano. 2013. A corpus-based approach for the induction of ontology lexica. In International Conference on Application of Natural Language to Information Systems. Springer, pages 102–113.
  • Williams (1992) Ronald J Williams. 1992. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning 8(3-4):229–256.
  • Wiseman et al. (2017) Sam Wiseman, Stuart Shieber, and Alexander Rush. 2017. Challenges in data-to-document generation. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. Copenhagen, Denmark, pages 2243–2253.
  • Wu et al. (2015) Jiajun Wu, Yinan Yu, Chang Huang, and Kai Yu. 2015. Deep multiple instance learning for image classification and auto-annotation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Boston, Massachusetts, USA, pages 3460–3469.
  • Xu et al. (2014) Wei Xu, Alan Ritter, Chris Callison-Burch, William B Dolan, and Yangfeng Ji. 2014. Extracting lexically divergent paraphrases from Twitter. Transactions of the Association for Computational Linguistics 2:435–448.
  • Zaremba et al. (2014) Wojciech Zaremba, Ilya Sutskever, and Oriol Vinyals. 2014. Recurrent neural network regularization. CoRR abs/1409.2329.
  • Zhang et al. (2002) Qi Zhang, Sally A Goldman, Wei Yu, and Jason E Fritts. 2002. Content-based image retrieval using multiple-instance learning. In Proceedings of the 19th International Conference on Machine Learning. Sydney, Australia, volume 2, pages 682–689.
  • Zhang and Lapata (2014) Xingxing Zhang and Mirella Lapata. 2014. Chinese poetry generation with recurrent neural networks. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. Doha, Qatar, pages 670–680.
  • Zhang and Lapata (2017) Xingxing Zhang and Mirella Lapata. 2017. Sentence simplification with deep reinforcement learning. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. Copenhagen, Denmark, pages 595–605.