Recent neural models have shown significant progress on the problem of generating short descriptive texts conditioned on a small number of database records. In this work, we suggest a slightly more difficult data-to-text generation task, and investigate how effective current approaches are on this task. In particular, we introduce a new, large-scale corpus of data records paired with descriptive documents, propose a series of extractive evaluation methods for analyzing performance, and obtain baseline results using current neural generation methods. Experiments show that these models produce fluent text, but fail to convincingly approximate human-generated documents. Moreover, even templated baselines exceed the performance of these neural models on some metrics, though copy- and reconstruction-based extensions lead to noticeable improvements.READ FULL TEXT VIEW PDF
Recent advances in data-to-text generation have led to the use of large-...
Recent neural models for data-to-document generation have achieved remar...
Text generation system has made massive promising progress contributed b...
The advent of large pre-trained language models has made it possible to ...
Data-to-text generation models face challenges in ensuring data fidelity...
With the rise of neural models across the field of information retrieval...
Large astronomical databases obtained from sky surveys such as the
Over the past several years, neural text generation systems have shown impressive performance on tasks such as machine translation and summarization. As neural systems begin to move toward generating longer outputs in response to longer and more complicated inputs, however, the generated texts begin to display reference errors, inter-sentence incoherence, and a lack of fidelity to the source material. The goal of this paper is to suggest a particular, long-form generation task in which these challenges may be fruitfully explored, to provide a publically available dataset for this task, to suggest some automatic evaluation metrics, and finally to establish how current, neural text generation methods perform on this task.
A classic problem in natural-language generation (NLG) (Kukich, 1983; McKeown, 1992; Reiter and Dale, 1997) involves taking structured data, such as a table, as input, and producing text that adequately and fluently describes this data as output. Unlike machine translation, which aims for a complete transduction of the sentence to be translated, this form of NLG is typically taken to require addressing (at least) two separate challenges: what to say, the selection of an appropriate subset of the input data to discuss, and how to say it, the surface realization of a generation (Reiter and Dale, 1997; Jurafsky and Martin, 2014). Traditionally, these two challenges have been modularized and handled separately by generation systems. However, neural generation systems, which are typically trained end-to-end as conditional language models (Mikolov et al., 2010; Sutskever et al., 2011, 2014), blur this distinction.
In this context, we believe the problem of generating multi-sentence summaries of tables or database records to be a reasonable next-problem for neural techniques to tackle as they begin to consider more difficult NLG tasks. In particular, we would like this generation task to have the following two properties: (1) it is relatively easy to obtain fairly clean summaries and their corresponding databases for dataset construction, and (2) the summaries should be primarily focused on conveying the information in the database. This latter property ensures that the task is somewhat congenial to a standard encoder-decoder approach, and, more importantly, that it is reasonable to evaluate generations in terms of their fidelity to the database.
One task that meets these criteria is that of generating summaries of sports games from associated box-score data, and there is indeed a long history of NLG work that generates sports game summaries (Robin, 1994; Tanaka-Ishii et al., 1998; Barzilay and Lapata, 2005). To this end, we make the following contributions:
We introduce a new large-scale corpus consisting of textual descriptions of basketball games paired with extensive statistical tables. This dataset is sufficiently large that fully data-driven approaches might be sufficient.
We introduce a series of extractive evaluation models to automatically evaluate output generation performance, exploiting the fact that post-hoc information extraction is significantly easier than generation itself.
We apply a series of state-of-the-art neural methods, as well as a simple templated generation system, to our data-to-document generation task in order to establish baselines and study their generations.
Our experiments indicate that neural systems are quite good at producing fluent outputs and generally score well on standard word-match metrics, but perform quite poorly at content selection and at capturing long-term structure. While the use of copy-based models and additional reconstruction terms in the training loss can lead to improvements in BLEU and in our proposed extractive evaluations, current models are still quite far from producing human-level output, and are significantly worse than templated systems in terms of content selection and realization. Overall, we believe this problem of data-to-document generation highlights important remaining challenges in neural generation systems, and the use of extractive evaluation reveals significant issues hidden by standard automatic metrics.
We consider the problem of generating descriptive text from database records. Following the notation in Liang et al. (2009), let be a set of records, where for each we define to be the type of , and we assume each
to be a binarized relation, whereand are a record’s entity and value, respectively. For example, a database recording statistics for a basketball game might have a record such that = points, = Russell Westbrook, and . In this case, gives the player in question, and gives the number of points the player scored. From these records, we are interested in generating descriptive text, of words such that is an adequate and fluent summary of . A dataset for training data-to-document systems typically consists of pairs, where is a document consisting of a gold (i.e., human generated) summary for database .
Several benchmark datasets have been used in recent years for the text generation task, the most popular of these being WeatherGov (Liang et al., 2009) and Robocup (Chen and Mooney, 2008). Recently, neural generation systems have show strong results on these datasets, with the system of Mei et al. (2016) achieving BLEU scores in the 60s and 70s on WeatherGov, and BLEU scores of almost 30 even on the smaller Robocup dataset. These results are quite promising, and suggest that neural models are a good fit for text generation. However, the statistics of these datasets, shown in Table 1, indicate that these datasets use relatively simple language and record structure. Furthermore, there is reason to believe that WeatherGov is at least partially machine-generated (Reiter, 2017). More recently, Lebret et al. (2016) introduced the WikiBio dataset, which is at least an order of magnitude larger in terms of number of tokens and record types. However, as shown in Table 1, this dataset too only contains short (single-sentence) generations, and relatively few records per generation. As such, we believe that early success on these datasets is not yet sufficient for testing the desired linguistic capabilities of text generation at a document-scale.
With this challenge in mind, we introduce a new dataset for data-to-document text generation, available at https://github.com/harvardnlp/boxscore-data. The dataset is intended to be comparable to WeatherGov in terms of token count, but to have significantly longer target texts, a larger vocabulary space, and to require more difficult content selection.
The dataset consists of two sources of articles summarizing NBA basketball games, paired with their corresponding box- and line-score tables. The data statistics of these two sources, RotoWire and SBNation, are also shown in Table 1. The first dataset, RotoWire, uses professionally written, medium length game summaries targeted at fantasy basketball fans. The writing is colloquial, but relatively well structured, and targets an audience primarily interested in game statistics. The second dataset, SBNation, uses fan-written summaries targeted at other fans. This dataset is significantly larger, but also much more challenging, as the language is very informal, and often tangential to the statistics themselves. We show some sample text from RotoWire in Figure 1. Our primary focus will be on the RotoWire data.
We begin by discussing the evaluation of generated documents, since both the task we introduce and the evaluation methods we propose are motivated by some of the shortcomings of current approaches to evaluation. Text generation systems are typically evaluated using a combination of automatic measures, such as BLEU (Papineni et al., 2002), and human evaluation. While BLEU is perhaps a reasonably effective way of evaluating short-form text generation, we found it to be unsatisfactory for document generation. In particular, we note that it primarily rewards fluent text generation, rather than generations that capture the most important information in the database, or that report the information in a particularly coherent way. While human evaluation, on the other hand, is likely ultimately necessary for evaluating generations (Liu et al., 2016; Wu et al., 2016), it is much less convenient than using automatic metrics. Furthermore, we believe that current text generations are sufficiently bad in sufficiently obvious ways that automatic metrics can still be of use in evaluation, and we are not yet at the point of needing to rely solely on human evaluators.
To address this evaluation challenge, we begin with the intuition that assessing document quality is easier than document generation. In particular, it is much easier to automatically extract information from documents than to generate documents that accurately convey desired information. As such, simple, high-precision information extraction models can serve as the basis for assessing and better understanding the quality of automatic generations. We emphasize that such an evaluation scheme is most appropriate when evaluating generations (such as basketball game summaries) that are primarily intended to summarize information. While many generation problems do not fall into this category, we believe this to be an interesting category, and one worth focusing on because it is amenable to this sort of evaluation.
To see how a simple information extraction system might work, consider the document in Figure 1. We may first extract candidate entity (player, team, and city) and value (number and certain string) pairs that appear in the text, and then predict the type (or none) of each candidate pair. For example, we might extract the entity-value pair (“Miami Heat”, “95”) from the first sentence in Figure 1, and then predict that the type of this pair is points, giving us an extracted record such that (Miami Heat, 95, points). Indeed, many relation extraction systems reduce relation extraction to multi-class classification precisely in this way (Zhang, 2004; Zhou et al., 2008; Zeng et al., 2014; dos Santos et al., 2015).
More concretely, given a document , we consider all pairs of word-spans in each sentence that represent possible entities and values . We then model for each pair, using to indicate unrelated pairs. We use architectures similar to those discussed in Collobert et al. (2011) and dos Santos et al. (2015)
to parameterize this probability; full details are given in the Appendix.
Importantly, we note that the pairs typically used for training data-to-document systems are also sufficient for training the information extraction model presented above, since we can obtain (partial) supervision by simply checking whether a candidate record lexically matches a record in .111Alternative approaches explicitly align the document with the table for this task (Liang et al., 2009). However, since there may be multiple records with the same and but with different types
, we will not always be able to determine the type of a given entity-value pair found in the text. We therefore train our classifier to minimize a latent-variable loss: for all document spansand , with observed types (possibly ), we minimize
We find that this simple system trained in this way is quite accurate at predicting relations. On the Rotowire data it achieves over 90% accuracy on held-out data, and recalls approximately 60% of the relations licensed by the records.
With a sufficiently precise relation extraction system, we can begin to evaluate how well an automatic generation has captured the information in a set of records . In particular, since the predictions of a precise information extraction system serve to align entity-mention pairs in the text with database records, this alignment can be used both to evaluate a generation’s content selection (“what the generation says”), as well as content placement (“how the generation says it”).
We consider in particular three induced metrics:
Content Selection (CS): precision and recall of unique relationsextracted from that are also extracted from . This measures how well the generated document matches the gold document in terms of selecting which records to generate.
Relation Generation (RG): precision and number of unique relations extracted from that also appear in . This measures how well the system is able to generate text containing factual (i.e., correct) records.
Content Ordering (CO): normalized Damerau-Levenshtein Distance (Brill and Moore, 2000)222DLD is a variant of Levenshtein distance that allows transpositions of elements; it is useful in comparing the ordering of sequences that may not be permutations of the same set (which is a requirement for measures like Kendall’s Tau). between the sequences of records extracted from and that extracted from . This measures how well the system orders the records it chooses to discuss.
We note that CS primarily targets the “what to say” aspect of evaluation, CO targets the “how to say it” aspect, and RG targets both.
We conclude this section by contrasting the automatic evaluation we have proposed with recently proposed adversarial evaluation approaches, which also advocate automatic metrics backed by classification (Bowman et al., 2016; Kannan and Vinyals, 2016; Li et al., 2017). Unlike adversarial evaluation, which uses a black-box classifier to determine the quality of a generation, our metrics are defined with respect to the predictions of an information extraction system. Accordingly, our metrics are quite interpretable, since by construction it is always possible to determine which fact (i.e., entity-value pair) in the generation is determined by the extractor to not match the database or the gold generation.
In this section we briefly describe the neural generation methods we apply to the proposed task. As a base model we utilize the now standard attention-based encoder-decoder model (Sutskever et al., 2014; Cho et al., 2014; Bahdanau et al., 2015). We also experiment with several recent extensions to this model, including copy-based generation, and training with a source reconstruction term in the loss (in addition to the standard per-target-word loss).
For our base model, we map each record
into a vectorby first embedding (e.g., points), (e.g., Russell Westbrook), and (e.g., 50), and then applying a 1-layer MLP (similar to Yang et al. (2016)).333We also include an additional feature for whether the player is on the home- or away-team. Our source data-records are then represented as . Given , we use an LSTM decoder with attention and input-feeding, in the style of Luong et al. (2015), to compute the probability of each target word, conditioned on the previous words and on . The model is trained end-to-end to minimize the negative log-likelihood of the words in the gold text given corresponding source material .
There has been a surge of recent work involving augmenting encoder-decoder models to copy words directly from the source material on which they condition (Gu et al., 2016; Gülçehre et al., 2016; Merity et al., 2016; Jia and Liang, 2016; Yang et al., 2016)
. These models typically introduce an additional binary variableinto the per-timestep target word distribution, which indicates whether the target word is copied from the source or generated:
In our case, we assume that target words are copied from the value portion of a record ; that is, a copy implies for some and .
Gülçehre et al. (2016), on the other hand, decompose the joint probability as:
where an MLP is used to model .
Models with copy-decoders may be trained to minimize the negative log marginal probability, marginalizing out the latent-variable (Gu et al., 2016; Yang et al., 2016; Merity et al., 2016). However, if it is known which target words are copied, it is possible to train with a loss that does not marginalize out the latent . Gülçehre et al. (2016), for instance, assume that any target word that also appears in the source is copied, and train to minimize the negative joint log-likelihood of the and .
In applying such a loss in our case, we again note that there may be multiple records such that appears in . Accordingly, we slightly modify the portion of the loss of Gülçehre et al. (2016) to sum over all matched records. In particular, we model the probability of relations such that and is in the same sentence as . Letting , we have:
We note here that the key distinction for our purposes between the Joint Copy model and the Conditional Copy model is that the latter conditions on whether there is a copy or not, and so in the source records compete only with each other. In the Joint Copy model, however, the source records also compete with words that cannot be copied. As a result, training the Conditional Copy model with the supervised loss of Gülçehre et al. (2016) can be seen as training with a word-level reconstruction loss, where the decoder is trained to choose the record in that gives rise to .
Reconstruction-based techniques can also be applied at the document- or sentence-level during training. One simple approach to this problem is to utilize the hidden states of the decoder to try to reconstruct the database. A fully differentiable approach using the decoder hidden states has recently been successfully applied to neural machine translation byTu et al. (2017). Unlike copying, this method is applied only at training, and attempts to learn decoder hidden states with broader coverage of the input data.
In adopting this reconstruction approach we segment the decoder hidden states into contiguous blocks of size at most . Denoting a single one of these hidden state blocks as , we attempt to predict each field value in some record from . We define , the probability of the entity and value in record given , to be , where is a parameterized function of , which in our experiments utilize a convolutional layer followed by an MLP; full details are given in the Appendix. We further extend this idea and predict records in from , rather than one. We can train with the following reconstruction loss for a particular :
where is the ’th predicted distribution over records, and where we have modeled each component of independently. This loss attempts to make the most probable record in given more probable. We found that augmenting the above loss with a term that penalizes the total variation distance (TVD) between the to be helpful.444Penalizing the TVD between the might be useful if, for instance, is too large, and only a smaller number of records can be predicted from . We also experimented with encouraging, rather than penalizing the TVD between the , which might make sense if we were worried about ensuring the captured different records. Both and the TVD term are simply added to the standard negative log-likelihood objective at training time.
In this section we highlight a few important details of our models and methods; full details are in the Appendix. For our RotoWire models, the record encoder produces in , and we use a 2-layer LSTM decoder with hidden states of the same size as the , and dot-product attention and input-feeding in the style of Luong et al. (2015). Unlike past work, we use two identically structured attention layers, one to compute the standard generation probabilities ( or ), and one to produce the scores used in or .
We train the generation models using SGD and truncated BPTT (Elman, 1990; Mikolov et al., 2010), as in language modeling. That is, we split each into contiguous blocks of length 100, and backprop both the gradients with respect to the current block as well as with respect to the encoder parameters for each block.
Our extractive evaluator consists of an ensemble of 3 single-layer convolutional and 3 single-layer bidirectional LSTM models. The convolutional models concatenate convolutions with kernel widths 2, 3, and 5, and 200 feature maps in the style of (Kim, 2014). Both models are trained with SGD.
In addition to neural baselines, we also use a problem-specific, template-based generator. The template-based generator first emits a sentence about the teams playing in the game, using a templatized sentence taken from the training set:
The <team1> (<wins1>-<losses1>) defeated the <team2> (<wins2>-<losses2>) <pts1>-<pts2>.
Then, 6 player-specific sentences of the following form are emitted (again adapting a simple sentence from the training set):
<player> scored <pts> points (<fgm>-<fga> FG, <tpm>-<tpa> 3PT, <ftm>-<fta> FT) to go with <reb> rebounds.
The 6 highest-scoring players in the game are used to fill in the above template. Finally, a typical end sentence is emitted:
The <team1>’ next game will be at home against the Dallas Mavericks, while the <team2> will travel to play the Bulls.
We found that all models performed quite poorly on the SBNation data, with the best model achieving a validation perplexity of 33.34 and a BLEU score of 1.78. This poor performance is presumably attributable to the noisy quality of the SBNation data, and the fact that many documents in the dataset focus on information not in the box- and line-scores. Accordingly, we focus on RotoWire in what follows.
The main results for the RotoWire dataset are shown in Table 2, which shows the performance of the models in Section 4 in terms of the metrics defined in Section 3.2, as well as in terms of perplexity and BLEU.
|Joint Copy + Rec||57.81||8.31||23.65||23.30||9.02||7.25||10.00|
|Joint Copy + Rec + TVD||60.69||8.95||23.63||24.10||8.84||7.22||12.78|
|Joint Copy + Rec||62.11||10.90||21.36||26.26||9.07||7.25||10.85|
|Joint Copy + Rec + TVD||57.51||11.41||18.28||25.27||8.05||7.22||12.04|
|Joint Copy + Rec (B=5)||61.23||11.02||21.56||26.45||9.06||7.47||10.88|
|Joint Copy + Rec + TVD (B=1)||60.27||9.18||23.11||23.69||8.48||7.42||12.96|
|Conditional Copy (B=5)||71.82||12.82||22.17||27.16||8.68||7.67||14.49|
There are several interesting relationships in the development portion of Table 2. First we note that the Template model scores very poorly on BLEU, but does quite well on the extractive metrics, providing an upper-bound for how domain knowledge could help content selection and generation. All the neural models make significant improvements in terms of BLEU score, with the conditional copying with beam search performing the best, even though all the neural models achieve roughly the same perplexity.
The extractive metrics provide further insight into the behavior of the models. We first note that on the gold documents , the extractive model reaches precision. Using the Joint Copy model, generation only has a record generation (RG) precision of indicating that relationships are often generated incorrectly. The best Conditional Copy system improves this value to , a significant improvement and potentially the cause of the improved BLEU score, but still far below gold.
Notably, content selection (CS) and content ordering (CO) seem to have no correlation at all with BLEU. There is some improvement with CS for the conditional model or reconstruction loss, but not much change as we move to beam search. CO actually gets worse as beam search is utilized, possibly a side effect of generating more records (RG#). The fact that these scores are much worse than the simple templated model indicates that further research is needed into better copying alone for content selection and better long term content ordering models.
Test results are consistent with development results, indicating that the Conditional Copy model is most effective at BLEU, RG, and CS, and that reconstruction is quite helpful for improving the joint model.
We also undertook two human evaluation studies, using Amazon Mechanical Turk. The first study attempted to determine whether generations considered to be more precise by our metrics were also considered more precise by human raters. To accomplish this, raters were presented with a particular NBA game’s box score and line score, as well as with (randomly selected) sentences from summaries generated by our different models for those games. Raters were then asked to count how many facts in each sentence were supported by records in the box or line scores, and how many were contradicted. We randomly selected 20 distinct games to present to raters, and a total of 20 generated sentences per game were evaluated by raters. The left two columns of Table 3 contain the average numbers of supporting and contradicting facts per sentence as determined by the raters, for each model. We see that these results are generally in line with the RG and CS metrics, with the Conditional Copy model having the highest number of supporting facts, and the reconstruction terms significantly improving the Joint Copy models.
Using a Tukey HSD post-hoc analysis of an ANOVA with the number of contradicting facts as the dependent variable and the generating model and rater id as independent variables, we found significant () pairwise differences in contradictory facts between the gold generations and all models except “Copy+Rec+TVD,” as well as a significant difference between “Copy+Rec+TVD” and “Copy”. We similarly found a significant pairwise difference between “Copy+Rec+TVD” and “Copy” for number of supporting facts.
Our second study attempted to determine whether generated summaries differed in terms of how natural their ordering of records (as captured, for instance, by the DLD metric) is. To test this, we presented raters with random summaries generated by our models and asked them to rate the naturalness of the ordering of facts in the summaries on a 1-7 Likert scale. 30 random summaries were used in this experiment, each rated 3 times by distinct raters. The average Likert ratings are shown in the rightmost column of Table 3. While it is encouraging that the gold summaries received a higher average score than the generated summaries (and that the reconstruction term again improved the Joint Copy model), a Tukey HSD analysis similar to the one presented above revealed no significant pairwise differences.
|# Supp.||# Cont.||Order Rat.|
|Joint Copy + Rec||2.33||1.83||4.43|
|Joint Copy + Rec +TVD||2.43||1.16||4.18|
Figure 2 shows a document generated by the Conditional Copy model, using a beam of size 5. This particular generation evidently has several nice properties: it nicely learns the colloquial style of the text, correctly using idioms such as “19 percent from deep.” It is also partially accurate in its use of the records; we highlight in blue when it generates text that is licensed by a record in the associated box- and line-scores.
At the same time, the generation also contains major logical errors. First, there are basic copying mistakes, such as flipping the teams’ win/loss records. The system also makes obvious semantic errors; for instance, it generates the phrase “the Rockets were able to out-rebound the Rockets.” Finally, we see the model hallucinates factual statements, such as “in front of their home crowd,” which is presumably likely according to the language model, but ultimately incorrect (and not supported by anything in the box- or line- scores). In practice, our proposed extractive evaluation will pick up on many errors in this passage. For instance, “four assists” is an RG error, repeating the Rockets’ rebounds could manifest in a lower CO score, and incorrectly indicating the win/loss records is a CS error.
In this section we note additional related work not noted throughout. Natural language generation has been studied for decades (Kukich, 1983; McKeown, 1992; Reiter and Dale, 1997), and generating summaries of sports games has been a topic of interest for almost as long (Robin, 1994; Tanaka-Ishii et al., 1998; Barzilay and Lapata, 2005).
Historically, research has focused on both content selection (“what to say”) (Kukich, 1983; McKeown, 1992; Reiter and Dale, 1997; Duboue and McKeown, 2003; Barzilay and Lapata, 2005), and surface realization (“how to say it”) (Goldberg et al., 1994; Reiter et al., 2005) with earlier work using (hand-built) grammars, and later work using SMT-like approaches (Wong and Mooney, 2007) or generating from PCFGs (Belz, 2008) or other formalisms (Soricut and Marcu, 2006; White et al., 2007). In the late 2000s and early 2010s, a number of systems were proposed that did both (Liang et al., 2009; Angeli et al., 2010; Kim and Mooney, 2010; Lu and Ng, 2011; Konstas and Lapata, 2013).
Within the world of neural text generation, some recent work has focused on conditioning language models on tables (Yang et al., 2016), and generating short biographies from Wikipedia Tables (Lebret et al., 2016; Chisholm et al., 2017). Mei et al. (2016) use a neural encoder-decoder approach on standard record-based generation datasets, obtaining impressive results, and motivating the need for more challenging NLG problems.
This work explores the challenges facing neural data-to-document generation by introducing a new dataset, and proposing various metrics for automatically evaluating content selection, generation, and ordering. We see that recent ideas in copying and reconstruction lead to improvements on this task, but that there is a significant gap even between these neural models and templated systems. We hope to motivate researchers to focus further on generation problems that are relevant both to content selection and surface realization, but may not be reflected clearly in the model’s perplexity.
Future work on this task might include approaches that process or attend to the source records in a more sophisticated way, generation models that attempt to incorporate semantic or reference-related constraints, and approaches to conditioning on facts or records that are not as explicit in the box- and line-scores.
We gratefully acknowledge the support of a Google Research Award.
Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, pages 502–512. Association for Computational Linguistics.
Proceedings of the 25th international conference on Machine learning, pages 128–135. ACM.
Framewise phoneme classification with bidirectional lstm and other neural network architectures.Neural Networks, 18(5):602–610.
The RotoWire data covers NBA games played between 1/1/2014 and 3/29/2017; some games have multiple summaries. The summaries have been randomly split into training, validation, and test sets consisting of 3398, 727, and 728 summaries, respectively.
The SBNation data covers NBA games played between 11/3/2006 and 3/26/2017; some games have multiple summaries. The summaries have been randomly split into training, validation, and test sets consisting of 7633, 1635, and 1635 summaries, respectively.
All numbers in the box- and line-scores (but not the summaries) are converted to integers; fractional numbers corresponding to percents are multiplied by 100 to obtain integers in . We show the types of records in the data in Table 4.
For the RotoWire data, a relation is encoded into by embedding each of , , and a “home-or-away” indicator feature in
, and applying a 1-layer MLP (with ReLU nonlinearity) to map the concatenation of these vectors back into. To initialize the decoder LSTMs, we first mean-pool over the
by entity (giving one vector per entity), and then linearly transform the concatenation of these pooled entity-representations so that they can initialize the cells and hidden states of a 2-layer LSTM with states also in. The SBNation setup is identical, except all vectors are in .
As mentioned in the body of the paper, we compute two different attention distributions (i.e., using different parameters) at each decoding step. For the Joint Copy model, one attention distribution is not normalized, and is normalized along with all the output-word probabilities.
Within the Conditional Copy model we compute by mean-pooling the , concatenating them with the current (topmost) hidden state of the LSTM, and then feeding this concatenation via a 1-layer ReLU MLP with hidden dimension 600, and with a Sigmoid output layer.
For the reconstruction-loss, we feed blocks (of size at most 100) of the decoder’s LSTM hidden states through a (Kim, 2014)-style convolutional model. We use kernels of width 3 and 5, 200 filters, a ReLU nonlinearity, and max-over-time pooling. To create the
, these now 400-dimensional features are then mapped via an MLP with a ReLU nonlinearity into 3 separate 200 dimensional vectors corresponding to the predicted relation’s entity, value, and type, respectively. These 200 dimensional vectors are then fed through (separate) linear decoders and softmax layers in order to obtain distributions over entities, values, and types. We usedistinct .
Models are trained with SGD, a learning rate of 1 (which is divided by 2 every time validation perplexity fails to decrease), and a batch size of 16. We use dropout (at a rate of 0.5) between LSTM layers and before the linear decoder.
To form an information extraction dataset, we first sentence-tokenize the gold summary documents using NLTK (Bird, 2006). We then determine which word-spans could represent entities (by matching against players, teams, or cities in the database), and which word-spans could represent numbers (using the open source text2num library555https://github.com/exogen/text2num to convert (strings of) number-words into numbers).666We ignore certain particularly misleading number-words, such as ”three-point,” where we should not expect a corresponding value of 3 among the records. We then consider each pair in the same sentence, and if there is a record in the database such that and we annotate the pair with the label ; otherwise, we give it a label of .
We predict relations by ensembling 3 convolutional models and 3 bidirectional LSTM (Hochreiter and Schmidhuber, 1997; Graves and Schmidhuber, 2005) models. Each model consumes the words in the sentence, which are embedded in , as well as the distances of each word in the sentence from both the entity-word-span and the number-word-spans (as described above), which are each embedded in . These vectors are concatenated (into a vector in ) and fed into either a convolutional model or a bidirectional LSTM model.
The convolutional model uses 600 total filters, with 200 filters for kernels of width 2, 3, and 5, respectively, a ReLU nonlinearity, and max-pooling. These features are then mapped via a 1-layer (ReLU) MLP into, which predicts one of the 39 relation types (or ) with a linear decoder layer and softmax.
The bidirectional LSTM model uses a single layer with 500 units in each direction, which are concatenated. The hidden states are max-pooled, and then mapped via a 1-layer (ReLU) MLP into , which predicts one of the 39 relation types (or ) with a linear decoder layer and softmax.