Advances in sensor and data storage technologies have rapidly increased the amount of data produced in various fields such as weather, finance, and sports. In order to address the information overload caused by the massive data, data-to-text generation technology, which expresses the contents of data in natural language, becomes more important Barzilay and Lapata (2005). Recently, neural methods can generate high-quality short summaries especially from small pieces of data Liu et al. (2018).
Despite this success, it remains challenging to generate a high-quality long summary from data Wiseman et al. (2017). One reason for the difficulty is because the input data is too large for a naive model to find its salient part, i.e., to determine which part of the data should be mentioned. In addition, the salient part moves as the summary explains the data. For example, when generating a summary of a basketball game (Table 1 (b)) from the box score (Table 1 (a)), the input contains numerous data records about the game: e.g., Jordan Clarkson scored 18 points. Existing models often refer to the same data record multiple times Puduppully et al. (2019). The models may mention an incorrect data record, e.g., Kawhi Leonard added 19 points: the summary should mention LaMarcus Aldridge, who scored 19 points. Thus, we need a model that finds salient parts, tracks transitions of salient parts, and expresses information faithful to the input.
In this paper, we propose a novel data-to-text generation model with two modules, one for saliency tracking and another for text generation. The tracking module keeps track of saliency in the input data: when the module detects a saliency transition, the tracking module selects a new data record111We use ‘data record’ and ‘relation’ interchangeably. and updates the state of the tracking module. The text generation module generates a document conditioned on the current tracking state. Our model is considered to imitate the human-like writing process that gradually selects and tracks the data while generating the summary. In addition, we note some writer-specific patterns and characteristics: how data records are selected to be mentioned; and how data records are expressed as text, e.g., the order of data records and the word usages. We also incorporate writer information into our model.
The experimental results demonstrate that, even without writer information, our model achieves the best performance among the previous models in all evaluation metrics: 94.38% precision of relation generation, 42.40% F1 score of content selection, 19.38% normalized Damerau-Levenshtein Distance (DLD) of content ordering, and 16.15% of BLEU score. We also confirm that adding writer information further improves the performance.
Box score: Top contingency table shows number of wins and losses and summary of each game. Bottom table shows statistics of each player such as points scored (Player’s Pts), and total rebounds (Player’s Reb).
2 Related Work
2.1 Data-to-Text Generation
Data-to-text generation is a task for generating descriptions from structured or non-structured data including sports commentary Tanaka-Ishii et al. (1998); Chen and Mooney (2008); Taniguchi et al. (2019), weather forecast Liang et al. (2009); Mei et al. (2016), biographical text from infobox in Wikipedia Lebret et al. (2016); Sha et al. (2018); Liu et al. (2018) and market comments from stock prices Murakami et al. (2017); Aoki et al. (2018).
Neural generation methods have become the mainstream approach for data-to-text generation. The encoder-decoder framework Cho et al. (2014); Sutskever et al. (2014) with the attention Bahdanau et al. (2015); Luong et al. (2015) and copy mechanism Gu et al. (2016); Gulcehre et al. (2016) has successfully applied to data-to-text tasks. However, neural generation methods sometimes yield fluent but inadequate descriptions Tu et al. (2017). In data-to-text generation, descriptions inconsistent to the input data are problematic.
Recently, Wiseman et al. (2017) introduced the RotoWire dataset, which contains multi-sentence summaries of basketball games with box-score (Table 1). This dataset requires the selection of a salient subset of data records for generating descriptions. They also proposed automatic evaluation metrics for measuring the informativeness of generated summaries.
Puduppully et al. (2019)
proposed a two-stage method that first predicts the sequence of data records to be mentioned and then generates a summary conditioned on the predicted sequences. Their idea is similar to ours in that the both consider a sequence of data records as content planning. However, our proposal differs from theirs in that ours uses a recurrent neural network for saliency tracking, and that our decoder dynamically chooses a data record to be mentioned without fixing a sequence of data records.
2.2 Memory modules
The memory network can be used to maintain and update representations of the salient information Weston et al. (2015); Sukhbaatar et al. (2015); Graves et al. (2016). This module is often used in natural language understanding to keep track of the entity state Kobayashi et al. (2016); Hoang et al. (2018); Bosselut et al. (2018).
Recently, entity tracking has been popular for generating coherent text Kiddon et al. (2016); Ji et al. (2017); Yang et al. (2017); Clark et al. (2018). Kiddon et al. (2016) proposed a neural checklist model that updates predefined item states. Ji et al. (2017) proposed an entity representation for the language model. Updating entity tracking states when the entity is introduced, their method selects the salient entity state.
Our model extends this entity tracking module for data-to-text generation tasks. The entity tracking module selects the salient entity and appropriate attribute in each timestep, updates their states, and generates coherent summaries from the selected data record.
|First Name||Last Name||-||Player Pts||-||-||Player Reb||-||-||Player Ast||-|
, model predicts each random variable. Model firstly determines whether to refer to data records () or not (). If random variable , model selects entity , its attribute
and binary variablesif needed. For example, at , model predicts random variable and then selects the entity Jabari Parker and its attribute Player Pts. Given these values, model outputs token from selected data record.
Through careful examination, we found that in the original dataset RotoWire, some NBA games have two documents, one of which is sometimes in the training data and the other is in the test or validation data. Such documents are similar to each other, though not identical. To make this dataset more reliable as an experimental dataset, we created a new version.
We ran the script provided by Wiseman et al. (2017), which is for crawling the RotoWire website for NBA game summaries. The script collected approximately 78% of the documents in the original dataset; the remaining documents disappeared. We also collected the box-scores associated with the collected documents. We observed that some of the box-scores were modified compared with the original RotoWire dataset.
The collected dataset contains 3,752 instances (i.e., pairs of a document and box-scores). However, the four shortest documents were not summaries; they were, for example, an announcement about the postponement of a match. We thus deleted these 4 instances and were left with 3,748 instances. We followed the dataset split by Wiseman et al. (2017)
to split our dataset into training, development, and test data. We found 14 instances that didn’t have corresponding instances in the original data. We randomly classified 9, 2, and 3 of those 14 instances respectively into training, development, and test data. Finally, the sizes of our training, development, test dataset are respectively 2,714, 534, and 500. On average, each summary has 384 tokens and 644 data records. Each match has only one summary in our dataset, as far as we checked. We also collected the writer of each document. Our dataset contains 32 different writers. The most prolific writer in our dataset wrote 607 documents. There are also writers who wrote less than ten documents. On average, each writer wrote 117 documents. We call our new datasetRotoWire-Modified.222For information about the dataset, please follow this link: https://github.com/aistairc/rotowire-modified
4 Saliency-Aware Text Generation
At the core of our model is a neural language model with a memory state to generate a summary given a set of data records . Our model has another memory state , which is used to remember the data records that have been referred to. is also used to update , meaning that the referred data records affect the text generation.
Our model decides whether to refer to , which data record to be mentioned, and how to express a number. The selected data record is used to update . Formally, we use the four variables:
: binary variable that determines whether the model refers to input at time step ().
: At each time step , this variable indicates the salient entity (e.g., Hawks, LeBron James).
: At each time step , this variable indicates the salient attribute to be mentioned (e.g., Pts).
: If attribute of the salient entity is a numeric attribute, this variable determines if a value in the data records should be output in Arabic numerals (e.g., 50) or in English words (e.g., five).
To keep track of the salient entity, our model predicts these random variables at each time step through its summary generation process. Running example of our model is shown in Table 2 and full algorithm is described in Appendix A
. In the following subsections, we explain how to initialize the model, predict these random variables, and generate a summary. Due to space limitations, bias vectors are omitted.
Before explaining our method, we describe our notation. Let and denote the sets of entities and attributes, respectively. Each record consists of entity , attribute , and its value , and is therefore represented as . For example, the box-score in Table 1 has a record such that and .
Let denote the embedding of data record . Let denote the embedding of entity . Note that depends on the set of data records, i.e., it depends on the game. We also use for static embedding of entity , which, on the other hand, does not depend on the game.
Given the embedding of entity , attribute , and its value , we use the concatenation layer to combine the information from these vectors to produce the embedding of each data record , denoted as as follows:
where indicates the concatenation of vectors, and denotes a weight matrix.333We also concatenate the embedding vectors that represents whether the entity is in home or away team.
We obtain in the set of data records , by summing all the data-record embeddings transformed by a matrix:
where is a weight matrix for attribute . Since depends on the game as above, is supposed to represent how entity played in the game.
To initialize the hidden state of each module, we use embeddings of SoD for and averaged embeddings of for .
4.2 Saliency transition
Generally, the saliency of text changes during text generation. In our work, we suppose that the saliency is represented as the entity and its attribute being talked about. We therefore propose a model that refers to a data record at each timepoint, and transitions to another as text goes.
To determine whether to transition to another data record or not at time
, the model calculates the following probability:
is the sigmoid function. Ifis high, the model transitions to another data record.
When the model decides to transition to another, the model then determines which entity and attribute to refer to, and generates the next word (Section 4.3). On the other hand, if the model decides not transition to another, the model generates the next word without updating the tracking states (Section 4.4).
4.3 Selection and tracking
When the model refers to a new data record (), it selects an entity and its attribute. It also tracks the saliency by putting the information about the selected entity and attribute into the memory vector . The model begins to select the subject entity and update the memory states if the subject entity will change.
Specifically, the model first calculates the probability of selecting an entity:
where is the set of entities that have already been referred to by time step , and is defined as , which indicates the time step when this entity was last mentioned.
The model selects the most probable entity as the next salient entity and updates the set of entities that appeared ().
If the salient entity changes , the model updates the hidden state of the tracking model
with a recurrent neural network with a gated recurrent unit(Gru; Chung et al., 2014):
Note that if the selected entity at time step , , is identical to the previously selected entity , the hidden state of the tracking model is not updated.
If the selected entity is new (), the hidden state of the tracking model is updated with the embedding of entity as input. In contrast, if entity has already appeared in the past () but is not identical to the previous one , we use (i.e., the memory state when this entity last appeared) to fully exploit the local history of this entity.
Given the updated hidden state of the tracking model , we next select the attribute of the salient entity by the following probability:
After selecting , i.e., the most probable attribute of the salient entity, the tracking model updates the memory state with the embedding of the data record introduced in Section 4.1:
4.4 Summary generation
Given two hidden states, one for language model and the other for tracking model , the model generates the next word . We also incorporate a copy mechanism that copies the value of the salient data record .
If the model refers to a new data record (), it directly copies the value of the data record . However, the values of numerical attributes can be expressed in at least two different manners: Arabic numerals (e.g., 14) and English words (e.g., fourteen). We decide which one to use by the following probability:
where is a weight matrix. The model then updates the hidden states of the language model:
where is a weight matrix.
If the salient data record is the same as the previous one (), it predicts the next word via a probability over words conditioned on the context vector :
Subsequently, the hidden state of language model is updated:
where is the embedding of the word generated at time step .444In our initial experiment, we observed a word repetition problem when the tracking model is not updated during generating each sentence. To avoid this problem, we also update the tracking model with special trainable vectors to refresh these states after our model generates a period:
4.5 Incorporating writer information
We also incorporate the information about the writer of the summaries into our model. Specifically, instead of using Equation (9), we concatenate the embedding of a writer to to construct context vector :
where is a new weight matrix. Since this new context vector is used for calculating the probability over words in Equation (10), the writer information will directly affect word generation, which is regarded as surface realization in terms of traditional text generation. Simultaneously, context vector enhanced with the writer information is used to obtain , which is the hidden state of the language model and is further used to select the salient entity and attribute, as mentioned in Sections 4.2 and 4.3. Therefore, in our model, the writer information affects both surface realization and content planning.
4.6 Learning objective
We apply fully supervised training that maximizes the following log-likelihood:
|Wiseman et al. (2017)||22.93||60.14||24.24||31.20||27.29||14.70||14.73|
|Puduppully et al. (2019)||33.06||83.17||33.06||43.59||37.60||16.97||13.96|
5.1 Experimental settings
We used RotoWire-Modified as the dataset for our experiments, which we explained in Section 3. The training, development, and test data respectively contained 2,714, 534, and 500 games.
Since we take a supervised training approach, we need the annotations of the random variables (i.e., , , , and ) in the training data, as shown in Table 2. Instead of simple lexical matching with , which is prone to errors in the annotation, we use the information extraction system provided by Wiseman et al. (2017)
. Although this system is trained on noisy rule-based annotations, we conjecture that it is more robust to errors because it is trained to minimize the marginalized loss function for ambiguous relations. All training details are described in AppendixB.
5.2 Models to be compared
We compare our model555Our code is available from https://github.com/aistairc/sports-reporter against two baseline models. One is the model used by Wiseman et al. (2017), which generates a summary with an attention-based encoder-decoder model. The other baseline model is the one proposed by Puduppully et al. (2019), which first predicts the sequence of data records and then generates a summary conditioned on the predicted sequences. Wiseman et al. (2017)’s model refers to all data records every timestep, while Puduppully et al. (2019)’s model refers to a subset of all data records, which is predicted in the first stage. Unlike these models, our model uses one memory vector that tracks the history of the data records, during generation. We retrained the baselines on our new dataset. We also present the performance of the Gold and Templates summaries. The Gold summary is exactly identical with the reference summary and each Templates summary is generated in the same manner as Wiseman et al. (2017).
In the latter half of our experiments, we examine the effect of adding information about writers. In addition to our model enhanced with writer information, we also add writer information to the model by Puduppully et al. (2019). Their method consists of two stages corresponding to content planning and surface realization. Therefore, by incorporating writer information to each of the two stages, we can clearly see which part of the model to which the writer information contributes to. For Puduppully et al. (2019) model, we attach the writer information in the following three ways:
concatenating writer embedding with the input vector for LSTM in the content planning decoder (stage 1);
concatenating writer embedding with the input vector for LSTM in the text generator (stage 2);
using both 1 and 2 above.
For more details about each decoding stage, readers can refer to Puduppully et al. (2019).
5.3 Evaluation metrics
, i.e., relation generation (RG), content selection (CS), and content ordering (CO) as evaluation metrics. The extractive metrics measure how well the relations extracted from the generated summary match the correct relations666The model for extracting relation tuples was trained on tuples made from the entity (e.g., team name, city name, player name) and attribute value (e.g., “Lakers”, “92”) extracted from the summaries, and the corresponding attributes (e.g., “Team Name”, “Pts”) found in the box- or line-score. The precision and the recall of this extraction model are respectively 93.4% and 75.0% in the test data.:
RG: the ratio of the correct relations out of all the extracted relations, where correct relations are relations found in the input data records . The average number of extracted relations is also reported.
CS: precision and recall of the relations extracted from the generated summary against those from the reference summary.
CO: edit distance measured with normalized Damerau-Levenshtein Distance (DLD) between the sequences of relations extracted from the generated and reference summary.
6 Results and Discussions
We first focus on the quality of tracking model and entity representation in Sections 6.1 to 6.4, where we use the model without writer information. We examine the effect of writer information in Section 6.5.
6.1 Saliency tracking-based model
As shown in Table 3, our model outperforms all baselines across all evaluation metrics.777The scores of Puduppully et al. (2019)’s model significantly dropped from what they reported, especially on BLEU metric. We speculate this is mainly due to the reduced amount of our training data (Section 3). That is, their model might be more data-hungry than other models. One of the noticeable results is that our model achieves slightly higher RG precision than the gold summary. Owing to the extractive evaluation nature, the generated summary of the precision of the relation generation could beat the gold summary performance. In fact, the template model achieves 100% precision of the relation generations.
The other is that only our model exceeds the template model regarding F1 score of the content selection and obtains the highest performance of content ordering. This imply that the tracking model encourages to select salient input records in the correct order.
6.2 Qualitative analysis of entity embedding
Our model has the entity embedding , which depends on the box score for each game in addition to static entity embedding . Now we analyze the difference of these two types of embeddings.
We present a two-dimensional visualizations of both embeddings produced using PCA Pearson (1901). As shown in Figure 1, which is the visualization of static entity embedding , the top-ranked players are closely located.
We also present the visualizations of dynamic entity embeddings in Figure 2. Although we did not carry out feature engineering specific to the NBA (e.g., whether a player scored double digits or not)888In the NBA, a player who accumulates a double-digit score in one of five categories (points, rebounds, assists, steals, and blocked shots) in a game, is regarded as a good player. If a player had a double in two of those five categories, it is referred to as double-double. for representing the dynamic entity embedding , the embeddings of the players who performed well for each game have similar representations. In addition, the change in embeddings of the same player was observed depending on the box-scores for each game. For instance, LeBron James recorded a double-double in a game on April 22, 2016. For this game, his embedding is located close to the embedding of Kevin Love, who also scored a double-double. However, he did not participate in the game on December 26, 2016. His embedding for this game became closer to those of other players who also did not participate.
6.3 Duplicate ratios of extracted relations
As Puduppully et al. (2019) pointed out, a generated summary may mention the same relation multiple times. Such duplicated relations are not favorable in terms of the brevity of text.
Figure 3 shows the ratios of the generated summaries with duplicate mentions of relations in the development data. While the models by Wiseman et al. (2017) and Puduppully et al. (2019) respectively showed 36.0% and 15.8% as duplicate ratios, our model exhibited 4.2%. This suggests that our model dramatically suppressed generation of redundant relations. We speculate that the tracking model successfully memorized which input records have been selected in .
6.4 Qualitative analysis of output examples
Figure 5 shows the generated examples from validation inputs with Puduppully et al. (2019)’s model and our model. Whereas both generations seem to be fluent, the summary of Puduppully et al. (2019)’s model includes erroneous relations colored in orange.
Specifically, the description about Derrick Rose’s relations, “15 points, four assists, three rounds and one steal in 33 minutes.”, is also used for other entities (e.g., John Henson and Willy Hernagomez). This is because Puduppully et al. (2019)’s model has no tracking module unlike our model, which mitigates redundant references and therefore rarely contains erroneous relations.
However, when complicated expressions such as parallel structures are used our model also generates erroneous relations as illustrated by the underlined sentences describing the two players who scored the same points. For example, “11-point efforts” is correct for Courtney Lee but not for Derrick Rose. As a future study, it is necessary to develop a method that can handle such complicated relations.
6.5 Use of writer information
|Puduppully et al. (2019)||33.06||83.17||33.06||43.59||37.60||16.97||13.96|
|+ in stage 1||28.43||84.75||45.00||49.73||47.25||22.16||18.18|
|+ in stage 2||35.06||80.51||31.10||45.28||36.87||16.38||17.81|
|+ in stage 1 & 2||28.00||82.27||44.37||48.71||46.44||22.41||18.90|
We first look at the results of an extension of Puduppully et al. (2019)’s model with writer information in Table 4. By adding to content planning (stage 1), the method obtained improvements in CS (37.60 to 47.25), CO (16.97 to 22.16), and BLEU score (13.96 to 18.18). By adding to the component for surface realization (stage 2), the method obtained an improvement in BLEU score (13.96 to 17.81), while the effects on the other metrics were not very significant. By adding to both stages, the method scored the highest BLEU, while the other metrics were not very different from those obtained by adding to stage 1. This result suggests that writer information contributes to both content planning and surface realization when it is properly used, and improvements of content planning lead to much better performance in surface realization.
Our model showed improvements in most metrics and showed the best performance by incorporating writer information . As discussed in Section 4.5, is supposed to affect both content planning and surface realization. Our experimental result is consistent with the discussion.
In this research, we proposed a new data-to-text model that produces a summary text while tracking the salient information that imitates a human-writing process. As a result, our model outperformed the existing models in all evaluation measures. We also explored the effects of incorporating writer information to data-to-text models. With writer information, our model successfully generated highest quality summaries that scored 20.84 points of BLEU score.
We would like to thank the anonymous reviewers for their helpful suggestions. This paper is based on results obtained from a project commissioned by the New Energy and Industrial Technology Development Organization (NEDO), JST PRESTO (Grant Number JPMJPR1655), and AIST-Tokyo Tech Real World Big-Data Computation Open Innovation Laboratory (RWBC-OIL).
- Aoki et al. (2018) Tatsuya Aoki, Akira Miyazawa, Tatsuya Ishigaki, Keiichi Goshima, Kasumi Aoki, Ichiro Kobayashi, Hiroya Takamura, and Yusuke Miyao. 2018. Generating Market Comments Referring to External Resources. In Proceedings of the 11th International Conference on Natural Language Generation, pages 135–139.
- Bahdanau et al. (2015) Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate. In Proceedings of the Third International Conference on Learning Representations.
Barzilay and Lapata (2005)
Regina Barzilay and Mirella Lapata. 2005.
selection for concept-to-text generation.
Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing, pages 331–338.
- Bosselut et al. (2018) Antoine Bosselut, Omer Levy, Ari Holtzman, Corin Ennis, Dieter Fox, and Yejin Choi. 2018. Simulating Action Dynamics with Neural Process Networks. In Proceedings of the Sixth International Conference on Learning Representations.
Chen and Mooney (2008)
David L Chen and Raymond J Mooney. 2008.
sportscast: a test of grounded language acquisition.
Proceedings of the 25th international conference on Machine learning, pages 128–135.
- Cho et al. (2014) Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, pages 1724–1734.
- Chung et al. (2014) Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. 2014. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555.
- Clark et al. (2018) Elizabeth Clark, Yangfeng Ji, and Noah A Smith. 2018. Neural Text Generation in Stories Using Entity Representations as Context. In Proceedings of the 16th Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2250–2260.
- Glorot and Bengio (2010) Xavier Glorot and Yoshua Bengio. 2010. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pages 249–256.
- Graves et al. (2016) Alex Graves, Greg Wayne, Malcolm Reynolds, Tim Harley, Ivo Danihelka, Agnieszka Grabska-Barwińska, Sergio Gómez Colmenarejo, Edward Grefenstette, Tiago Ramalho, John Agapiou, et al. 2016. Hybrid computing using a neural network with dynamic external memory. Nature, 538(7626):471.
- Gu et al. (2016) Jiatao Gu, Zhengdong Lu, Hang Li, and Victor OK Li. 2016. Incorporating Copying Mechanism in Sequence-to-Sequence Learning. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, volume 1, pages 1631–1640.
- Gulcehre et al. (2016) Caglar Gulcehre, Sungjin Ahn, Ramesh Nallapati, Bowen Zhou, and Yoshua Bengio. 2016. Pointing the Unknown Words. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, pages 140–149.
- Hoang et al. (2018) Luong Hoang, Sam Wiseman, and Alexander Rush. 2018. Entity Tracking Improves Cloze-style Reading Comprehension. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 1049–1055.
- Ji et al. (2017) Yangfeng Ji, Chenhao Tan, Sebastian Martschat, Yejin Choi, and Noah A Smith. 2017. Dynamic Entity Representations in Neural Language Models. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 1830–1839.
- Kiddon et al. (2016) Chloé Kiddon, Luke Zettlemoyer, and Yejin Choi. 2016. Globally coherent text generation with neural checklist models. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 329–339.
- Kobayashi et al. (2016) Sosuke Kobayashi, Ran Tian, Naoaki Okazaki, and Kentaro Inui. 2016. Dynamic entity representation with max-pooling improves machine reading. In Proceedings of the 15th Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 850–855.
- Lebret et al. (2016) Rémi Lebret, David Grangier, and Michael Auli. 2016. Neural Text Generation from Structured Data with Application to the Biography Domain. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 1203–1213.
- Liang et al. (2009) Percy Liang, Michael I Jordan, and Dan Klein. 2009. Learning semantic correspondences with less supervision. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, pages 91–99.
- Liu et al. (2018) Tianyu Liu, Kexiang Wang, Lei Sha, Baobao Chang, and Zhifang Sui. 2018. Table-to-text Generation by Structure-aware Seq2seq Learning. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence.
- Luong et al. (2015) Thang Luong, Hieu Pham, and Christopher D Manning. 2015. Effective Approaches to Attention-based Neural Machine Translation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 1412–1421.
- Mei et al. (2016) Hongyuan Mei, Mohit Bansal, and Matthew R Walter. 2016. What to talk about and how? Selective Generation using LSTMs with Coarse-to-Fine Alignment. In Proceedings of the 15th Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 720–730.
- Murakami et al. (2017) Soichiro Murakami, Akihiko Watanabe, Akira Miyazawa, Keiichi Goshima, Toshihiko Yanase, Hiroya Takamura, and Yusuke Miyao. 2017. Learning to generate market comments from stock prices. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, pages 1374–1384.
- Neubig et al. (2017) Graham Neubig, Chris Dyer, Yoav Goldberg, Austin Matthews, Waleed Ammar, Antonios Anastasopoulos, Miguel Ballesteros, David Chiang, Daniel Clothiaux, Trevor Cohn, et al. 2017. Dynet: The dynamic neural network toolkit. arXiv preprint arXiv:1701.03980.
- Papineni et al. (2002) Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics, pages 311–318.
- Pearson (1901) Karl Pearson. 1901. On lines and planes of closest fit to systems of points in space. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science, 2(11):559–572.
- Puduppully et al. (2019) Ratish Puduppully, Li Dong, and Mirella Lapata. 2019. Data-to-Text Generation with Content Selection and Planning. In Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence.
- Reddi et al. (2018) Sashank J Reddi, Satyen Kale, and Sanjiv Kumar. 2018. On the convergence of adam and beyond. In Proceedings of the Sixth International Conference on Learning Representations.
- Sha et al. (2018) Lei Sha, Lili Mou, Tianyu Liu, Pascal Poupart, Sujian Li, Baobao Chang, and Zhifang Sui. 2018. Order-planning neural text generation from structured data. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence.
- Sukhbaatar et al. (2015) Sainbayar Sukhbaatar, Jason Weston, Rob Fergus, et al. 2015. End-to-end memory networks. In Advances in neural information processing systems, pages 2440–2448.
- Sutskever et al. (2014) Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pages 3104–3112.
- Tanaka-Ishii et al. (1998) Kumiko Tanaka-Ishii, Kôiti Hasida, and Itsuki Noda. 1998. Reactive content selection in the generation of real-time soccer commentary. In Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics, pages 1282–1288.
- Taniguchi et al. (2019) Yasufumi Taniguchi, Yukun Feng, Hiroya Takamura, and Manabu Okumura. 2019. Generating Live Soccer-Match Commentary from Play Data. In Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence.
- Tu et al. (2017) Zhaopeng Tu, Yang Liu, Zhengdong Lu, Xiaohua Liu, and Hang Li. 2017. Context gates for neural machine translation. Transactions of the Association for Computational Linguistics, 5:87–99.
- Weston et al. (2015) Jason Weston, Sumit Chopra, and Antoine Bordes. 2015. Memory Networks. In Proceedings of the Third International Conference on Learning Representations.
- Wiseman et al. (2017) Sam Wiseman, Stuart Shieber, and Alexander Rush. 2017. Challenges in Data-to-Document Generation. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2253–2263.
- Yang et al. (2017) Zichao Yang, Phil Blunsom, Chris Dyer, and Wang Ling. 2017. Reference-Aware Language Models. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 1850–1859.
Appendix A Algorithm
The generation process of our model is shown in Algorithm LABEL:alg. For a concise description, we omit the condition for each probability notation. SoD and EoD represent “start of the document” and “end of the document”, respectively. algocf[b]
Appendix B Experimental settings
We set the dimensions of the embeddings to 128, and those of the hidden state of RNN to 512 and all of parameters are initialized with the Xavier initialization Glorot and Bengio (2010)
. We set the maximum number of epochs to 30, and choose the model with the highestBleu score on the development data. The initial learning rate is 2e-3 and AMSGrad is also used for automatically adjusting the learning rate Reddi et al. (2018). Our implementation uses DyNet Neubig et al. (2017).