Two Birds, One Stone: A Simple, Unified Model for Text Generation from Structured and Unstructured Data

09/23/2019 ∙ by Hamidreza Shahidi, et al. ∙ University of Waterloo 0

A number of researchers have recently questioned the necessity of increasingly complex neural network (NN) architectures. In particular, several recent papers have shown that simpler, properly tuned models are at least competitive across several NLP tasks. In this work, we show that this is also the case for text generation from structured and unstructured data. We consider neural table-to-text generation and neural question generation (NQG) tasks for text generation from structured and unstructured data, respectively. Table-to-text generation aims to generate a description based on a given table, and NQG is the task of generating a question from a given passage where the generated question can be answered by a certain sub-span of the passage using NN models. Experimental results demonstrate that a basic attention-based seq2seq model trained with the exponential moving average technique achieves the state of the art in both tasks.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recent NLP literature can be characterized as increasingly complex neural network architectures that eke out progressively smaller gains over previous models. Following a previous line of research Melis et al. (2017); Mohammed et al. (2018); Adhikari et al. (2019), we investigate the necessity of such complicated neural architectures. In this work, our focus is on text generation from structured and unstructured data by considering description generation from a table (table-to-text) and question generation from a passage and a target answer (text-to-text).

More specifically, the goal in the neural table-to-text generation task is to generate biographies based on Wikipedia infoboxes (structured data). An infobox is a factual table with a number of fields (e.g., name, nationality, and occupation) describing a person. For this task, we use the WikiBio dataset Lebret et al. (2016) as the benchmark dataset. Figure 1

shows an example of a biographic infobox from this dataset as well as the target output textual description.


Target Output:
Sir Bernard Augustus Keen FRS (5 September 1890 – 5 August 1981) was a British soil scientist and Fellow of University College London.

Figure 1: An example infobox from the WikiBio dataset and the corresponding target output description.
Passage: Hydrogen is commonly used in power stations as a coolant in generators due to a number of favorable properties that are a direct result of its light diatomic molecules.
Answer: as a coolant in generators
Question: How is hydrogen used at power stations?
Table 1: A sample (passage, answer, question) triplet from the SQuAD dataset.

Automatic question generation aims to generate a syntactically correct, semantically meaningful and relevant question from a natural language text and a target answer within it (unstructured data). This is a crucial yet challenging task in NLP which has received growing attention due to its application in improving question answering systems Duan et al. (2017); Tang et al. (2017, 2018), providing material for educational purposes Heilman and Smith (2010), and helping conversational systems to start and continue a conversation Mostafazadeh et al. (2016). We adopt the widely used SQuAD dataset Rajpurkar et al. (2016) for this task. Table 1 presents a sample (passage, answer, question) triple from this dataset.

Prior work has made remarkable progress on both of these tasks. However, the proposed models utilize complex neural architectures to capture necessary information from the input(s). In this paper, we question the need for such sophisticated NN models for text generation from inputs comprising structured and unstructured data.

Specifically, we adopt a bi-directional, attention-based sequence-to-sequence (seq2seq) model Sutskever et al. (2014) equipped with a copy mechanism Gu et al. (2016) for both tasks. We demonstrate that this model, together with the exponential moving average (EMA) technique, achieves the state of the art in both neural table-to-text generation and NQG. Interestingly, our model is able to achieve this result even without using any linguistic features.

Our contributions are two-fold: First, we propose a unified NN model for text generation from structured and unstructured data built on an attention-based seq2seq model equipped with a copy mechanism. We show that training this model with the EMA technique leads to the state of the art in neural table-to-text generation as well as NQG. Second, because our model is in essence the main building block of previous models, our results show that some previous papers propose needless complexity, and that gains from these previous complex neural architectures are quite modest. In other words, the state of the art is achieved by careful tuning of simple and well-engineered models, not necessarily by adding more complexity to the model, echoing the sentiments of Lipton and Steinhardt (2018).

2 Related Work

In this section, we first discuss previous work for neural table-to-text generation, and then NQG.

2.1 Neural Table-to-Text Generation

Recently, there have been a number of end-to-end trainable NN models for table-to-text generation. Lebret et al. (2016)

propose an n-gram statistical language model that incorporates field and position embeddings to represent the structure of a table. However, their model is not effective enough to capture long-range contextual dependencies while generating descriptions.

To address this issue, Liu et al. (2018) suggest a structure-aware seq2seq model with local and global addressing on the table. While local addressing is realized by content encoding of the model’s encoder and word-level attention, global addressing is realized by field encoding in a field-gating LSTM with field-level attention. Field-gating is a mechanism introduced to consider field information when updating the cell memory of the LSTM units.

Liu et al. (2019b)

propose a two-level hierarchical encoder with coarse-to-fine attention to model the field-value structure of a table. They also propose three joint tasks (sequence labeling, text autoencoding, and multi-label classification) as auxiliary supervision to capture accurate semantic representations of the tables.

In this paper, similar to Lebret et al. (2016), we use both content and field information to represent a table by concatenating field and position embeddings to the word embeddings. Unlike Liu et al. (2018), we don’t separate local and global addressing by using specific modules for each, but rather adopt the EMA technique and let the bi-directional model accomplish this implicitly, exploiting the natural advantages of the model.

2.2 Neural Question Generation

Previous NQG models can be classified into rule-based and neural-network-based approaches.

Du et al. (2017)

propose a seq2seq model for question generation that is able to achieve better results than previous rule-based systems. However, their model does not take the target answer into consideration. In contrast,

Zhou et al. (2017) concatenate answer position indicators to word embeddings to make the model aware of the target answer. They also use lexical features (e.g., POS and NER tags) to enrich their model’s encoder. Song et al. (2018) suggest using a multi-perspective context matching algorithm to further leverage the information from explicit interactions between the passage and the target answer.

Recently, Kim et al. (2018) use answer-separated seq2seq, which replaces the target answer in the passage with a special token to avoid using the answer words in the generated question. They also utilize keyword-net to extract key information from the target answer. Similarly, Liu et al. (2019a) propose using a clue word predictor by adopting graph convolution networks to highlight the imperative aspects of the input passage.

Our model is architecturally more similar to Zhou et al. (2017)

, but with the following distinctions: (1) we do not use lexical features (i.e., POS tag, NER tag, and word case feature), (2) we utilize EMA technique while training and use the averaged weights for evaluation, (3) we do not make use of the introduced maxout hidden layer, and (4) we adopt LSTM units instead of GRU units. These distinctions, along with some hyperparameter differences, notably the optimizer and learning rate, have a huge impact on the experimental results (see Section

5).

Figure 2: An overview of our model.

3 Model: Seq2Seq with Attention and a Copy Mechanism

In this section, we introduce a simple but effective attention-based seq2seq model for both neural table-to-text generation and NQG. Figure 2 provides an overview of our model.

3.1 Encoder

Our encoder is a bi-directional LSTM (BiLSTM) whose input at time step is the concatenation of the current word embedding with some additional task-specific features.

For neural table-to-text generation, additional features are field name and position information following Lebret et al. (2016). The position information itself is the concatenation of , which is the position of the current word in its field when counting from the left, and , when counting from the right. Considering the word University in Figure 1 as an example, it is the first word from the left and the third word from the right in the Institutions field. Hence, the structural information of this word would be {Institutions,1,3}. That being said, the input to the encoder at time step for this task is . We hypothesize that our model is able to utilize the information in and more effectively as a result of deploying a bi-directional encoder.

For NQG, similar to Zhou et al. (2017), we use a single bit , indicating whether the tth word in the passage belongs to the target answer, as an additional feature. Hence, the input at time step is . Remarkably, unlike previous work Kim et al. (2018); Song et al. (2018), we do not use a separate encoder for the target answer to have a unified model for both tasks.

Each hidden state of the encoder is the concatenation of a forward hidden state and a backward hidden state . Eventually, we initialize our decoder according to the following equation:

(1)

where is a weight matrix, and is the number of input words.

3.2 Attention-Based Decoder

Our decoder is an attentional LSTM model Bahdanau et al. (2014). The following equations summarize our decoder’s operations:

(2)
(3)
(4)
(5)
(6)

where , , , and

are hidden states of the decoder, output word, context vector, and vocabulary distribution, respectively, all at time step

; , , , , , and are trainable parameters.

Due to the considerable overlap between input and output words, we use a copy mechanism that integrates the attention distribution over the input words with the vocabulary distribution:

(7)

where is a gate that determines whether to generate a word from the vocabulary or to copy it directly from the input text, and is the current attention distribution. The following expression defines as:

(8)

where the vectors , , , and the scalar are model parameters to be learned.

3.3 Exponential Moving Average

The exponential moving average (EMA) technique is generally referred to as temporal averaging and was introduced to be used in in optimization algorithms for better generalization performance and reducing noise from stochastic approximation in recent parameter estimates by averaging the model parameters

Moulines and Bach (2011); Kingma and Ba (2014).

In applying the technique, we maintain two sets of parameters: (1) training parameters which are trained as usual, and (2) evaluation parameters that are exponentially weighted moving averages of training parameters. The moving average is calculated using the following expression:

(9)

where is the decay rate value. Previous work has shown that evaluations that use the averaged parameters often produce more stable and accurate results than the final trained values. As shown in Section 5, this simple training technique has a huge effect in improving the performance of our model in both tasks.

4 Experimental Setup

In this section, we first introduce the dataset and implementation details, and then present our experimental results for both neural table-to-text generation and NQG.

4.1 Datasets

We use the WikiBio dataset Lebret et al. (2016) for neural table-to-text generation. This dataset contains 728,321 articles from English Wikipedia and uses the first sentence of each article as the ground-truth description of the corresponding infobox. The dataset has been divided into training (80%), validation (10%), and test (10%) sets.

For NQG, the SQuAD dataset Rajpurkar et al. (2016) is used as the benchmark dataset. It contains 23,215 paragraphs with over 100k questions and their answers. The test set of the original dataset is not publicly available. That being said, Du et al. (2017) and Zhou et al. (2017) re-allocate it into train, validation, and test sets, which we call split-1 and split-2, respectively.

Models Split-1 Split-2
BLEU-4 METEOR ROUGE-L BLEU-4 METEOR ROUGE-L
Heilman (2011) - - - 9.47 31.68 18.97
Du et al. (2017) 12.28 16.62 39.75 - - -
Zhou et al. (2017) - - - 13.29 - -
Zhou et al. (2018) - - - 13.02 - 44.0
Yao et al. (2018) - - - 13.36 17.70 40.42
Song et al. (2018) 13.98 18.77 42.72 13.91 - -
Zhao et al. (2018) 15.32 19.29 43.91 15.82 19.67 44.24
Sun et al. (2018) - - - 15.64 - -
Kumar et al. (2018) 16.17 19.85 43.90 - - -
Kim et al. (2018) 16.20 ± 0.32 19.92 ± 0.20 43.96 ± 0.25 16.17 ± 0.35 - -
Liu et al. (2019a) - - - 17.55 21.24 44.53
Our Model 14.81 ± 0.47 19.69 ± 0.24 43.01 ± 0.28 16.14 ± 0.25 20.44 ± 0.20 43.95 ± 0.26
+ EMA 16.29 ± 0.04 20.70 ± 0.08 44.18 ± 0.15 17.47 ± 0.10 21.37 ± 0.06 45.18 ± 0.22
Table 2: Experimental results for NQG on the test set.
Models BLEU-4 ROUGE-4
KN11footnotemark: 1 2.21 0.38
Template KN77footnotemark: 7 19.80 10.70
Lebret et al. (2016) 34.70 ± 0.36 25.80 ± 0.36
Bao et al. (2018) 40.26 -
Sha et al. (2018) 43.91 37.15
Liu et al. (2018) Orig. 44.89 ± 0.33 41.21 ± 0.25
Liu et al. (2018) Repl. 44.45 ± 0.11 39.65 ± 0.10
Liu et al. (2019b) 45.14 ± 0.34 41.26 ± 0.37
Our Model 46.07 ± 0.17 41.53 ± 0.30
+ EMA 46.76 ± 0.03 43.54 ± 0.07
Table 3: Experimental results for neural table-to-text generation on the test set. 11footnotemark: 1KN is Kneser-Ney language model Heafield et al. (2013). 77footnotemark: 7Template KN is a KN language model over templates. Both models are proposed by Lebret et al. (2016) as baselines.

4.2 Implementation Details

For the sake of reproducibility, we discuss implementation details for achieving the results shown in Tables 2 and 7. We train the model using cross-entropy loss, and retain the model that works best on the validation set during training for both tasks. We replace unknown tokens with a word from the input having the highest attention score according to Equation 4. In addition, a decay rate of is used for the exponential moving average in both tasks.

For the neural table-to-text generation task, we train the model up to 10 epochs with a batch size of 32. We use a single-layer BiLSTM for the encoder and a single-layer LSTM for the decoder and set the dimension of the LSTM hidden states to 500. Optimization is performed using the Adam optimizer with a learning rate of 0.0005 and gradient clipping when its norm exceeds 5. The dimension of the word, field, and position embeddings are set to 400, 50, and 5, respectively. The maximum position number is set to 30. Hence, any word in a higher position number is considered to be at position 30. The most frequent 20k words and 1,480 fields in the training set are selected as word and field vocabulary, respectively, for both the encoder and the decoder.

For the NQG task, we use a two-layer BiLSTM for the encoder and a single-layer LSTM for the decoder. We set the dimension of the LSTM hidden states to 350 and 512 for split-1 and split-2, respectively. Optimization is performed using the AdaGrad optimizer with a learning rate of 0.3 and gradient clipping when its norm exceeds 5. The word embeddings are initialized with pre-trained 300-dimensional GloVe Pennington et al. (2014)

, which is frozen during training. We train the model up to 20 epochs with a batch size of 50 and employ dropout with probability 0.1 and 0.3 for split-1 and split-2, respectively. Moreover, we use the vocabulary set released by

Song et al. (2018) for both the encoder and the decoder. During decoding, we perform beam search with beam size of 20 and length penalty weight of 1.75.

5 Results and Discussion

We report the mean and standard deviation of each metric across multiple seeds (three for text-to-table generation and five for NQG) to ensure robustness against potentially spurious conclusions

Crane (2018). In Tables 2 and 7, we list previous results for NQG and neural table-to-text generation, respectively. All results are copied from the original papers except Liu et al. (2018) in Table 7, where Repl. refers to the scores from experiments that we conducted using the source code released by the authors and Orig. refers to the scores taken from the paper.

It is noteworthy that a similar version of our model has served as baselines in previous papers Liu et al. (2018); Kim et al. (2018); Liu et al. (2019a). However, the distinctions discussed in Section 2, and especially the EMA technique, enable our model to achieve the state of the art in all cases except BLEU-4 on data split-2 of NQG, where our score is very competitive; furthermore, Liu et al. (2019a) only report results from a single trial. Our results indicate that a basic seq2seq model is able to effectively learn the underlying distribution of both datasets.

6 Conclusions

In this paper, we question the necessity of complex neural architectures for text generation from structured data (neural table-to-text generation) and unstructured data (NQG). Instead, we propose a simple yet effective seq2seq model trained with the EMA technique. Empirically, our model achieves the state of the art in both tasks. Our results highlight the importance of exploring simple models before introducing complex neural architectures.

References

  • A. Adhikari, A. Ram, R. Tang, and J. Lin (2019) Rethinking complex neural network architectures for document classification. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4046–4051. Cited by: §1.
  • D. Bahdanau, K. Cho, and Y. Bengio (2014) Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473. Cited by: §3.2.
  • J. Bao, D. Tang, N. Duan, Z. Yan, Y. Lv, M. Zhou, and T. Zhao (2018) Table-to-text: describing table region with natural language. In

    Thirty-Second AAAI Conference on Artificial Intelligence

    ,
    Cited by: Table 3.
  • M. Crane (2018) Questionable answers in question answering research: reproducibility and variability of published results. Transactions of the Association of Computational Linguistics 6, pp. 241–252. Cited by: §5.
  • X. Du, J. Shao, and C. Cardie (2017) Learning to ask: neural question generation for reading comprehension. arXiv preprint arXiv:1705.00106. Cited by: §2.2, §4.1, Table 2.
  • N. Duan, D. Tang, P. Chen, and M. Zhou (2017) Question generation for question answering. In

    Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

    ,
    pp. 866–874. Cited by: §1.
  • J. Gu, Z. Lu, H. Li, and V. O.K. Li (2016) Incorporating copying mechanism in sequence-to-sequence learning. arXiv preprint arXiv:1603.06393. Cited by: §1.
  • K. Heafield, I. Pouzyrevsky, J. H. Clark, and P. Koehn (2013) Scalable modified Kneser-Ney language model estimation. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp. 690–696. Cited by: Table 3.
  • M. Heilman and N. A. Smith (2010) Good question! statistical ranking for question generation. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 609–617. Cited by: §1.
  • M. Heilman (2011) Automatic factual question generation from text. Ph.D. Thesis, Carnegie Mellon University, Pittsburgh, PA, USA. Note: AAI3528179 External Links: ISBN 978-1-267-58224-9 Cited by: Table 2.
  • Y. Kim, H. Lee, J. Shin, and K. Jung (2018) Improving neural question generation using answer separation. arXiv preprint arXiv:1809.02393. Cited by: §2.2, §3.1, Table 2, §5.
  • D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §3.3.
  • V. Kumar, G. Ramakrishnan, and Y. Li (2018)

    A framework for automatic question generation from text using deep reinforcement learning

    .
    arXiv preprint arXiv:1808.04961. Cited by: Table 2.
  • R. Lebret, D. Grangier, and M. Auli (2016) Neural text generation from structured data with application to the biography domain. arXiv preprint arXiv:1603.07771. Cited by: §1, §2.1, §2.1, §3.1, §4.1, Table 3.
  • Z. C. Lipton and J. Steinhardt (2018)

    Troubling trends in machine learning scholarship

    .
    In arXiv:1807.03341v2, Cited by: §1.
  • B. Liu, M. Zhao, D. Niu, K. Lai, Y. He, H. Wei, and Y. Xu (2019a) Learning to generate questions by learning what not to generate. arXiv preprint arXiv:1902.10418. Cited by: §2.2, Table 2, §5.
  • T. Liu, F. Luo, Q. Xia, S. Ma, B. Chang, and Z. Sui (2019b) Hierarchical encoder with auxiliary supervision for neural table-to-text generation: learning better representation for tables. In Thirty-Third AAAI Conference on Artificial Intelligence, Cited by: §2.1, Table 3.
  • T. Liu, K. Wang, L. Sha, B. Chang, and Z. Sui (2018) Table-to-text generation by structure-aware seq2seq learning. In Thirty-Second AAAI Conference on Artificial Intelligence, Cited by: §2.1, §2.1, Table 3, §5, §5.
  • G. Melis, C. Dyer, and P. Blunsom (2017) On the state of the art of evaluation in neural language models. arXiv preprint arXiv:1707.05589. Cited by: §1.
  • S. Mohammed, P. Shi, and J. Lin (2018)

    Strong baselines for simple question answering over knowledge graphs with and without neural networks

    .
    In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pp. 291–296. Cited by: §1.
  • N. Mostafazadeh, I. Misra, J. Devlin, M. Mitchell, X. He, and L. Vanderwende (2016) Generating natural questions about an image. arXiv preprint arXiv:1603.06059. Cited by: §1.
  • E. Moulines and F. R. Bach (2011)

    Non-asymptotic analysis of stochastic approximation algorithms for machine learning

    .
    In Advances in Neural Information Processing Systems, pp. 451–459. Cited by: §3.3.
  • J. Pennington, R. Socher, and C. Manning (2014) GloVe: global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp. 1532–1543. Cited by: §4.2.
  • P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang (2016) Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250. Cited by: §1, §4.1.
  • L. Sha, L. Mou, T. Liu, P. Poupart, S. Li, B. Chang, and Z. Sui (2018) Order-planning neural text generation from structured data. In Thirty-Second AAAI Conference on Artificial Intelligence, Cited by: Table 3.
  • L. Song, Z. Wang, W. Hamza, Y. Zhang, and D. Gildea (2018) Leveraging context information for natural question generation. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pp. 569–574. Cited by: §2.2, §3.1, §4.2, Table 2.
  • X. Sun, J. Liu, Y. Lyu, W. He, Y. Ma, and S. Wang (2018) Answer-focused and position-aware neural question generation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 3930–3939. Cited by: Table 2.
  • I. Sutskever, O. Vinyals, and Q. V. Le (2014) Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pp. 3104–3112. Cited by: §1.
  • D. Tang, N. Duan, T. Qin, Z. Yan, and M. Zhou (2017) Question answering and question generation as dual tasks. arXiv preprint arXiv:1706.02027. Cited by: §1.
  • D. Tang, N. Duan, Z. Yan, Z. Zhang, Y. Sun, S. Liu, Y. Lv, and M. Zhou (2018) Learning to collaborate for question answering and asking. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp. 1564–1574. Cited by: §1.
  • K. Yao, L. Zhang, T. Luo, L. Tao, and Y. Wu (2018) Teaching machines to ask questions. In Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, IJCAI-18, pp. 4546–4552. Cited by: Table 2.
  • Y. Zhao, X. Ni, Y. Ding, and Q. Ke (2018) Paragraph-level neural question generation with maxout pointer and gated self-attention networks. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 3901–3910. Cited by: Table 2.
  • Q. Zhou, N. Yang, F. Wei, C. Tan, H. Bao, and M. Zhou (2017) Neural question generation from text: a preliminary study. In National CCF Conference on Natural Language Processing and Chinese Computing, pp. 662–671. Cited by: §2.2, §2.2, §3.1, §4.1, Table 2.
  • Q. Zhou, N. Yang, F. Wei, and M. Zhou (2018) Sequential copying networks. In Thirty-Second AAAI Conference on Artificial Intelligence, Cited by: Table 2.