Natural question generation (QG) is a challenging yet rewarding problem. It has many useful applications including improving the question answering task Chen et al. (2017, 2019a) by providing more training data Du et al. (2017); Tang et al. (2017); Song et al. (2017); Yuan et al. (2017); Li et al. (2018), generating practice exercises and assessments for educational purposes Heilman and Smith (2010); Heilman (2011); Danon and Last (2017) and helping dialog systems, such as Alexa and Google Assistant, to kick-start and continue a conversation with human users Mostafazadeh et al. (2016). While many other works focus on QG from images Mostafazadeh et al. (2016); Fan et al. (2018); Li et al. (2018) or knowledge bases Serban et al. (2016); Elsahar et al. (2018), in this work, we focus on QG from textual data.
for QG rely on heuristic rules or hand-crafted templates, with low generalizability and scalability. Recent attempts have focused on neural network-based approaches that do not require manually-designed rules and are end-to-end trainable. Inspired by neural machine translation, these approaches formulate the QG task as a sequence-to-sequence (Seq2Seq) learning problem, applying various types of encoders and decoders, and have shown promising resultsDu et al. (2017); Zhou et al. (2017); Song et al. (2018a, 2017); Kumar et al. (2018a). However, these methods ignore the hidden structural information associated with a word sequence such as the syntactic parsing tree. Thus, they may fail to utilize the rich text structure which complements the simple word sequence.
It has been observed that in general, training Seq2Seq models using cross-entropy based objectives has some limitations such as exposure bias and inconsistency between train/test measurement Ranzato et al. (2015); Wu et al. (2016); Norouzi et al. (2016); Paulus et al. (2017); Gong et al. (2019)
, and thus does not always produce the best results on discrete evaluation metrics. To tackle these issues, some recent QG approachesSong et al. (2017); Kumar et al. (2018b) aim at directly optimizing evaluation metrics using Reinforcement Learning (RL) Williams (1992)
. However, existing approaches generally do not consider joint mixed objective functions with both semantic and syntactic constraints for guiding text generation.
Early works on neural QG did not take into account the target answer information when generating a question. Recent works Zhou et al. (2017); Song et al. (2018a, 2017); Kumar et al. (2018a); Kim et al. (2018); Liu et al. (2019) have explored various ways of utilizing the target answer to guide the generation of the question, resulting in more relevant questions with better quality. These methods mainly focus on either simply marking the answer location in the passage Zhou et al. (2017); Liu et al. (2019) using complex passage-answer matching strategies Song et al. (2018a, 2017), or carefully separating answers from passages when applying a Seq2Seq model. However, they neglect potential semantic relations between word pairs and thus do not explicitly model the global interactions among sequence words in the embedding space.
To address the aforementioned issues, we propose a reinforcement learning based generator-evaluator architecture for QG. Our generator employs a modified graph-to-sequence (Graph2Seq) Xu et al. (2018a) model that encodes the graph representation of a passage using a graph neural network (GNN) based encoder, and outputs a question sequence using a RNN decoder. Our evaluator is trained by optimizing a mixed objective function combining both cross-entropy loss and RL loss which is based on discrete evaluation metrics, where we first pretrain the model with regular cross-entropy loss. To apply GNNs to non-structured textual data, we also explore both static and dynamic ways of constructing a graph. In addition, we introduce a simple yet effective Deep Alignment Network (DAN) for incorporating answer information to the passage.
We highlight our contributions as follows:
We propose a novel RL based Graph2Seq model for natural question generation.
We design a simple yet effective Deep Alignment Network for explicitly modeling answer information.
We design a novel bidirectional Gated Graph Neural Network to process directed passage graphs.
We design a two-stage training strategy to train the proposed model with both cross-entropy and RL losses.
We explore different ways of constructing passage graphs and investigate their performance impact on a GNN encoder.
The proposed model achieves new state-of-the-art scores and outperforms all previous methods by a great margin on the SQuAD Rajpurkar et al. (2016) benchmark.
2 An RL-based Generator-Evaluator Architecture
Given a passage and a target answer , the task of natural question generation is to generate the best natural language question which maximizes the conditional likelihood .
We use 300-dim GloVe Pennington et al. (2014) embeddings and 1024-dim BERT Devlin et al. (2018) embeddings to embed each word in the passage and the answer. Let us denote and as the GloVe embeddings of passage word and answer word , respectively. Similarly, their corresponding BERT embeddings are denoted by and , respectively.
2.1 Deep Alignment Network
Answer information is crucial for generating relevant questions from a passage. Unlike previous methods that simply mark the answer location in the passage Zhou et al. (2017), we propose the Deep Alignment Network (DAN) for incorporating answer information to the passage with multiple granularity levels. In order to model the interactions between a passage and a target answer with different levels of granularity, we conduct attention-based soft-alignment at both the word-level and the contextualized hidden state level.
Let us denote and as the passage embeddings and the answer embeddings, respectively. The soft-alignment operation consists of three steps: i) compute attention score for each pair of passage word and answer word : ii) multiply the attention matrix with the answer embeddings to obtain the aligned answer embeddings for the passage; iii) concatenate the above aligned answer embeddings with the passage embeddings to get the final passage embeddings. To simplify notation, we denote the soft-alignment function as
, meaning that an attention matrix is computed between two sets of vectorsand , which is later used to get a linear combination of vector set .
In this work, we define the attention score where is a trainable model parameter with being the hidden state size.
2.1.1 Word-level Alignment
We first do soft-alignment at the word-level based on the GloVe embeddings of the passage and the answer. Specifically, we compute the aligned answer embeddings by , where is the aligned answer embedding for the -th passage word.
For each passage word , we concatenate its GloVe embedding , BERT embedding , linguistic feature (i.e., case, NER and POS) vector and the aligned answer embedding to obtain a -dim vector . A bidirectional LSTM Hochreiter and Schmidhuber (1997) is applied to the passage embeddings to obtain contextualized passage embeddings . Similarly, for each answer word , we concatenate its GloVe embedding with its BERT embedding to obtain a -dim vector . Another BiLSTM is applied to the answer embeddings to obtain contextualized answer embeddings .
2.1.2 Hidden-level Alignment
We proceed to do soft-alignment at the hidden state level based on the contextualized passage and answer embeddings. Similarly, we compute the aligned answer embeddings by .
Finally, we apply another BiLSTM to the concatenation of the contextualized passage embeddings and the aligned answer embeddings to get a by passage embedding matrix .
2.2 Bidirectional Graph Encoder
While a Recurrent Neural Network (RNN) is good at modeling local interactions among consecutive passage words, a Graph Neural Network (GNN) can better utilize the rich text structure and the global interactions among sequence words, and thus further improve the representations. Therefore, we construct a passage graphwhich consists of each passage word as a node, and apply a GNN to the graph.
2.2.1 Passage Graph Construction
While it is straightforward to apply GNNs to graph structured data, it is not clear what is the best way of representing text as a graph. In this work, we explore both static and dynamic ways of constructing a graph for textual data.
Syntax-based static graph construction: We construct a directed unweighted passage graph based on dependency parsing. For each sentence in a passage, we first get its dependency parse tree. Then, we connect those neighboring dependency parse trees by connecting those nodes that are at a sentence boundary and next to each other.
Semantic-aware dynamic graph construction: We dynamically build a weighted graph to model semantic relationships among passage words. We make the process of building such a graph depend on not only the passage, but also on the answer. Specifically, we first apply a self-attention mechanism to the word-level passage embeddings to compute an attention matrix , serving as a weighted adjacency matrix for the passage graph, defined as where is a trainable weight.
Considering that a fully connected passage graph is not only computationally expensive but also makes little sense for graph processing, we proceed to extract a sparse and directed graph from
via a KNN-style strategy where we only keep thenearest neighbors (including itself) as well as the associated attention scores (i.e., the remaining attentions scores are masked off) for each node. Finally, we apply a softmax function to these selected adjacency matrix elements to get two normalized adjacency matrices, namely, and , for incoming and outgoing directions, respectively.
Note that the supervision signal is able to back-propagate through the KNN-style graph sparsification operation since the nearest attention scores are kept and used to compute the weighted adjacency matrices.
2.2.2 Bidirectional Gated Graph Neural Networks
Unlike the previous Graph2Seq work Xu et al. (2018a) that uses bidirectional GraphSAGE Hamilton et al. (2017), we instead propose a novel Bidirectional Gated Graph Neural Network (BiGGNN) to process the directed passage graph. We extend the Gated Graph Sequence Neural Networks Li et al. (2015) to handle directed graphs. Node embeddings are initialized to the passage embeddings produced by the BiLSTM network.
In BiGGNN, the same set of network parameters are shared at every hop of computation. At each hop of computation, for every node in the graph, we apply an aggregation function which takes as input a set of incoming (or outgoing) neighboring node vectors and outputs a backward (or forward) aggregation vector. For the syntax-based static graph, we use a mean aggregator defined as,
And for the semantic-aware dynamic graph, we compute a weighted average for aggregation where the weights come from the normalized adjacency matrices and , defined as,
While Xu et al. (2018a) learn separate node embeddings for both directions independently, we choose to use a fusion function to fuse the information aggregated in two directions in every hop, which we find works better.
And the fusion function is designed as a gated sum of two information sources,
is a sigmoid function andis a gating vector.
Finally, a Gated Recurrent Unit (GRU)Cho et al. (2014) is used to update the node embeddings by incorporating the aggregation information,
After -hops of GNN computation where
is a hyperparameter, we denote the final state representation of nodeas . To compute the graph-level representation, we first apply a linear projection to the node embeddings, and then apply a maxpooling over all node embeddings to get a -dim vector .
On the decoder side, we follow the previous works See et al. (2017); Song et al. (2017) to adopt an attention-based Bahdanau et al. (2014); Luong et al. (2015) LSTM model with copy Vinyals et al. (2015); Gu et al. (2016) and coverage mechanisms Tu et al. (2016). The decoder takes the graph-level embedding followed by two separate fully-connected layers as initial hidden states (i.e., and ) and the node embeddings as the attention memory, and generates the output sequence one word at a time.
The particular decoder used in this paper closely follows the work of See et al. (2017) and we refer the readers to See et al. (2017) for more details on the decoder. We now briefly describe the decoder. Basically, at each decoding step , an attention mechanism learns to attend to the most relevant words in the input sequence, and computes a context vector based on the current decoding state , the current coverage vector
and the attention memory. In addition, the generation probabilityis calculated from the context vector , the decoder state and the decoder input . Next, is used as a soft switch to choose between generating a word from the vocabulary, or copying a word from the input sequence. We dynamically maintain an extended vocabulary which is the union of the usual vocabulary and all words appearing in a batch of source examples (i.e., passages and answers). Finally, in order to encourage the decoder to utilize the diverse components of the input sequence, a coverage mechanism is applied. At each step, we maintain a coverage vector , which is the sum of attention distributions over all previous decoder time steps. A coverage loss is also computed to penalize repeatedly attending to the same locations of the input sequence.
2.4 Policy Based Reinforcement Learning
The most widely used way to train a sequence learning model is to optimize the log-likelihood of the ground-truth output sequence at each decoding step, which is known as the teacher forcing algorithm Williams and Zipser (1989).
However, it has been observed that optimizing such cross-entropy based training objectives for sequence learning does not always produce the best results on discrete evaluation metrics Ranzato et al. (2015); Wu et al. (2016); Norouzi et al. (2016); Paulus et al. (2017); Gong et al. (2019). There are two main limitations of this method. First, a model has access to the ground-truth sequence up to the next token during training but does not have such supervision when testing, resulting in accumulated errors. This gap of model behavior between training and inference is called exposure bias Ranzato et al. (2015). Second, there is an evaluation discrepancy between training and testing. A model is optimized with cross-entropy loss during training while evaluated with discrete evaluation metrics during testing.
To address the above issues, we present an evaluator where a mixed objective function combining both cross-entropy loss and RL loss is designed for ensuring the generation of semantically and syntactically valid text. We learn a policy that directly optimizes an evaluation metric using REINFORCE Williams (1992). In this work, we utilize the self-critical policy gradient training algorithm Rennie et al. (2017)
, which is an efficient REINFORCE algorithm that, rather than estimating the reward signal, or how the reward signal should be normalized, utilizes the output of its own test-time inference algorithm to normalize the rewards it experiences.
For this training algorithm, at each training iteration, the model generates two output sequences: the sampled output , produced by multinomial sampling, that is, each word is sampled according to the likelihood predicted by the generator, and the baseline output
, obtained by greedy search, that is, maximizing the output probability distribution at each decoding step. We defineas the reward of an output sequence , computed by comparing it to corresponding ground-truth sequence
with some reward metrics. The loss function is defined as:
As we can see, if the sampled output has a higher reward than the baseline one, we maximize its likelihood, and vice versa.
Evaluation metric as reward function: We use one of our evaluation metrics, BLEU-4, as our reward function , which lets us directly optimize the model towards the evaluation metrics.
Semantic metric as reward function:
One drawback of some evaluation metrics like BLEU is that they do not measure meaning, but only reward systems for n-grams that have exact matches in the reference system. To make our reward function more effective and robust, we additionally use word mover’s distance (WMD) as a semantic reward function. WMD is the state-of-the-art approach to measure the dissimilarity between two sentences based on word embeddings Kusner et al. (2015). Following Gong et al. (2019), we take the negative of the WMD distance between a generated sequence and the ground-truth sequence and divide it by the sequence length as its semantic score.
We define the final reward function as where is a scalar.
2.5 Training and Testing
We train our model in two stages. In the first state, we train the model using regular cross-entropy loss, defined as,
where is the word at the -th position of the ground-truth output sequence and is the coverage loss defined as , with being the -th element of the attention vector over the input sequence at time step . Note that the scheduled teacher forcing strategy Bengio et al. (2015) is adopted to alleviate the exposure bias problem.
In the second stage, we fine-tune the model by optimizing a mixed objective function combining both cross-entropy loss and reinforcement loss, defined as,
for machine translation and text summarization.
During the testing phase, we use beam search of width 5 to generate our final predictions.
In this section, we conduct evaluation of our proposed model against state-of-the-art models. Following previous works, we use the SQuAD dataset Rajpurkar et al. (2016) as our benchmark, described below. The implementation of the model will be publicly available soon.
3.1 Data and Metrics
SQuAD contains more than 100K questions posed by crowd workers on 536 Wikipedia articles. Since the test set of the original SQuAD is not publicly available, the accessible parts (∼90%) are used as the entire dataset in our experiments. For fair comparison, we evaluated our model on both data split-1 Song et al. (2018a)111https://www.cs.rochester.edu/~lsong10/downloads/nqg_data.tgz that contains 75,500/17,934/11,805 examples and data split-2 Zhou et al. (2017) 222https://res.qyzhou.me/redistribute.zip that contains 86,635/8,965/8,964 examples.
Following previous works, we report and use BLEU-4 Papineni et al. (2002), METEOR Banerjee and Lavie (2005) and ROUGE-L Lin (2004) as our evaluation metrics (BLEU-4 and METEOR were initially designed for evaluating machine translation systems and ROUGE-L was initially designed for evaluating text summarization systems).
3.2 Model Settings
We keep and fix the GloVe vectors for the most frequent 70,000 words in the training set. For BERT embeddings, we compute BERT embeddings on the fly for each word in text using a (trainable) weighted sum of all BERT layer outputs. The embedding sizes of case, POS and NER tags are set to 3, 12 and 8, respectively. The size of all hidden layers is set to 300. We apply a variational dropout Kingma et al. (2015)
rate of 0.4 after word embedding layers and 0.3 after RNN layers. We set the neighborhood size to 10 for dynamic graph construction. The number of GNN hops is set to 3. During training, in each epoch, we set the initial teacher forcing probability to 0.75 and exponentially increase it towhere is the training step. We set in the reward function to 0.1, in the mixed loss function to 0.99, and the coverage loss ratio to 0.4. We use Adam Kingma and Ba (2014) as the optimizer and the learning rate is set to 0.001 in the pretraining stage and 0.00001 in the fine-tuning stage. We reduce the learning rate by a factor of 0.5 if the validation BLEU-4 score stops improving for three epochs. We stop the training when no improvement is seen for 10 epochs. We clip the gradient at length 10. The batch size is set to 60 and 50 on data split-1 and split-2, respectively. All hyper-parameters are tuned on the development set.
3.3 Baseline methods
3.4 Evaluation Results and Case Study
|NQG++ Zhou et al. (2017)||–||–||–||13.29|
|MPQG+R Song et al. (2017)||13.98||18.77||42.72||13.91|
|ASs2s Kim et al. (2018)||16.20||19.92||43.96||16.17|
|CGC-QG Liu et al. (2019)||–||–||–||17.55|
|G2S +BERT+RL (our model)||17.94||21.76||46.02||18.30|
|Passage: for the successful execution of a project ,|
|effective planning is essential .|
|Gold: what is essential for the successful execution of a|
|Seq2Seq: what type of planning is essential for the|
|G2S w/o DAN.: what type of planning is essential for|
|the successful execution of a project ?|
|G2S: what is essential for the successful execution|
|of a project ?|
|G2S +BERT: what is essential for the successful|
|execution of a project ?|
|G2S +BERT+RL: what is essential for the|
|successful execution of a project ?|
|Passage: the church operates three hundred sixty schools|
|and institutions overseas .|
|Gold: how many schools and institutions does the|
|church operate overseas ?|
|Seq2Seq: how many schools does the church have ?|
|G2S w/o DAN.: how many schools does the|
|church have ?|
|G2S: how many schools and institutions does the|
|church have ?|
|G2S +BERT: how many schools and institutions|
|does the church have ?|
|G2S +BERT+RL: how many schools and institutions|
|does the church operate ?|
Table 1 shows the experimental results comparing against all baseline methods. First of all, as we can see, our full model G2S +BERT+RL achieves the new state-of-the-art scores on both data splits and outperforms all previous methods by a great margin. Notably, some previous state-of-the-art methods relied on many heuristic rules and ad-hoc strategies. For instance, by observing that the generated words are mostly from frequent words, while most low-frequency words are copied from the input, rather than generated, CGC-QG Liu et al. (2019)
annotated clue words in the passage based on word frequency and overlapping, masked out low-frequency passage word embeddings, and reduced the target output vocabulary to boost the model performance. ASs2sKim et al. (2018) replaced the target answer in the original passage with a special token. However, our proposed model does not rely on any of these hand-crafted rules or ad-hoc strategies.
In Table 2, we further show a few examples that illustrate the quality of generated text given a passage under different models. As we can see, incorporating answer information helps the model identify the answer type of the question to be generated, thus makes the generated questions more relevant and specific. Also, we find our Graph2Seq model can generate more complete and valid questions compared to the Seq2Seq baseline. We think it is because a Graph2Seq model is able to exploit the rich text structure information better than a Seq2Seq model. Lastly, it shows that fine-tuning the model using REINFORCE can improve the quality of the generated questions.
3.5 Ablation Study of the Proposed Model
|G2S w/o feat.||16.51|
|G2S w/o feat.||16.65|
|G2S w/o DAN||12.58|
|G2S w/o DAN||12.62|
|G2S w/ DAN-word only||15.92|
|G2S w/ DAN-hidden only||16.07|
|G2S w/ GGNN-forward||16.53|
|G2S w/ GGNN-backward||16.75|
|G2S w/o BIGGNN (Seq2Seq)||16.14|
|G2S w/o BIGGNN, w/ GCN||14.47|
As shown in Table3, we also perform ablation study on the impact of different components (e.g., DAN, BIGGNN and RL fine-tuning) on the SQuAD split-2 test set. By turning off the Deep Alignment Network (DAN), the BLEU-4 score of G2S dramatically drops from to , which indicates the importance of answer information for QG and shows the effectiveness of DAN. We have a similar observation for the G2S model. Further experiments demonstrate that both word-level (G2S w/ DAN-word only) and hidden-level (G2S w/ DAN-hidden only) answer alignments in DAN are helpful. We can see the advantages of Graph2Seq learning over Seq2Seq learning on this task by comparing the performance between G2S and Seq2Seq. In our experiments, we also observe that doing both forward and backward message passing in the GNN encoder is beneficial. Surprisingly, using GCN Kipf and Welling (2016) as the graph encoder (and converting the input graph to an undirected graph) harms the performance. In addition, fine-tuning the model using REINFORCE can further improve the model performance in all settings (i.e., w/ and w/o BERT), which shows the benefits of directly optimizing evaluation metrics. Besides, we find that the pretrained BERT embedding has a considerable impact on the performance and fine-tuning BERT embedding even further improves the performance, which demonstrates the power of large-scale pretrained language models. Incorporating common linguistic features (i.e., case, POS, NER) also helps the overall performance to some extent. Lastly, we find that syntax-based static graph construction slightly performs better than semantic-aware dynamic graph construction, even though the latter one seems to be more powerful (i.e., answer-aware and history-aware) and can be optimized towards the QG task in an end-to-end manner. Another big advantage of dynamic graph construction we can see is that it does not rely on domain knowledge to construct the graph. We leave how to better construct a dynamic graph for textual data in an end-to-end manner as future work. Another interesting direction is to explore effective ways of combining both the static and dynamic graphs.
4 Related Work
4.1 Natural Question Generation
Early works Mostow and Chen (2009); Heilman and Smith (2010); Heilman (2011); Hussein et al. (2014) for QG focused on rule-based approaches that rely on heuristic rules or hand-crafted templates, with low generalizability and scalability. Recent attempts have focused on Neural Network (NN) based approaches that do not require manually-designed rules and are end-to-end trainable. Existing NN based approaches Du et al. (2017); Zhou et al. (2017); Song et al. (2018a); Kumar et al. (2018a) rely on the Seq2Seq model with attention, copy or coverage mechanisms. In addition, various ways Zhou et al. (2017); Song et al. (2018a, 2017); Kim et al. (2018) have been proposed to utilize the target answer so as to guide the generation of the question. To address the limitations of cross-entropy based sequence learning, some approaches Song et al. (2017); Kumar et al. (2018b) aim at directly optimizing evaluation metrics using REINFORCE.
4.2 Graph Neural Networks
Over the past few years, graph neural networks (GNNs) Kipf and Welling (2016); Gilmer et al. (2017); Hamilton et al. (2017); Li et al. (2015) have drawn increasing attention. Recently, GNNs have been applied to extend the widely used Seq2Seq architectures Sutskever et al. (2014); Cho et al. (2014) to Graph2Seq architectures Xu et al. (2018a, b, c); Song et al. (2018b). Very recently, researchers have explored methods to automatically construct a graph of visual objects Norcliffe-Brown et al. (2018) or words Liu et al. (2018); Chen et al. (2019b) when applying GNNs to non-graph structured data.
We proposed a novel reinforcement learning based Graph2Seq model for natural question generation, where the answer information is utilized by a simple yet effective Deep Alignment Network and a novel bidirectional GNN is proposed to process the directed passage graph. Our two-stage training strategy takes the benefits of both cross-entropy based and REINFORCE based training when training a sequence learning model. On the SQuAD dataset, our proposed model achieves the new state-of-the-art scores and outperforms all previous methods by a great margin. We also explore different ways of constructing graphs of textual data for graph neural networks. In the future, we would like to investigate more effective ways of automatically learning graph structures from free text.
- Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473. Cited by: §2.3.
- METEOR: an automatic metric for mt evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pp. 65–72. Cited by: §3.1.
- Scheduled sampling for sequence prediction with recurrent neural networks. In Advances in Neural Information Processing Systems, pp. 1171–1179. Cited by: §2.5.
- Reading wikipedia to answer open-domain questions. arXiv preprint arXiv:1704.00051. Cited by: §1.
- Bidirectional attentive memory networks for question answering over knowledge bases. arXiv preprint arXiv:1903.02188. Cited by: §1.
- GraphFlow: exploiting conversation flow with graph neural networks for conversational machine comprehension. arXiv preprint arXiv:1908.00059. Cited by: §4.2.
- Learning phrase representations using rnn encoder–decoder for statistical machine translation. In EMNLP, pp. 1724–1734. Cited by: §2.2.2, §4.2.
- A syntactic approach to domain-specific automatic question generation. arXiv preprint arXiv:1712.09827. Cited by: §1.
- Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §2.
- Learning to ask: neural question generation for reading comprehension. arXiv preprint arXiv:1705.00106. Cited by: §1, §1, §4.1.
Zero-shot question generation from knowledge graphs for unseen predicates and entity types. arXiv preprint arXiv:1802.06842. Cited by: §1.
- A reinforcement learning framework for natural question generation using bi-discriminators. In Proceedings of the 27th International Conference on Computational Linguistics, pp. 1763–1774. Cited by: §1.
Neural message passing for quantum chemistry.
Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1263–1272. Cited by: §4.2.
- Reinforcement learning based text style transfer without parallel training corpus. arXiv preprint arXiv:1903.10671. Cited by: §1, §2.4, §2.4.
- Incorporating copying mechanism in sequence-to-sequence learning. arXiv preprint arXiv:1603.06393. Cited by: §2.3.
- Inductive representation learning on large graphs. In Advances in Neural Information Processing Systems, pp. 1024–1034. Cited by: §2.2.2, §4.2.
- Good question! statistical ranking for question generation. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 609–617. Cited by: §1, §1, §4.1.
- Automatic factual question generation from text. Cited by: §1, §1, §4.1.
- Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §2.1.1.
- Automatic english question generation system based on template driven scheme. International Journal of Computer Science Issues (IJCSI) 11 (6), pp. 45. Cited by: §1, §4.1.
- Improving neural question generation using answer separation. arXiv preprint arXiv:1809.02393. Cited by: §1, §3.3, §3.4, Table 1, §4.1.
- Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §3.2.
- Variational dropout and the local reparameterization trick. In Advances in Neural Information Processing Systems, pp. 2575–2583. Cited by: §3.2.
- Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907. Cited by: §3.5, §4.2.
- Automating reading comprehension by generating question and answer pairs. In Pacific-Asia Conference on Knowledge Discovery and Data Mining, pp. 335–348. Cited by: §1, §1, §4.1.
- A framework for automatic question generation from text using deep reinforcement learning. arXiv preprint arXiv:1808.04961. Cited by: §1, §4.1.
- From word embeddings to document distances. In International Conference on Machine Learning, pp. 957–966. Cited by: §2.4.
- Visual question generation as dual task of visual question answering. In , pp. 6116–6124. Cited by: §1.
- Gated graph sequence neural networks. arXiv preprint arXiv:1511.05493. Cited by: §2.2.2, §4.2.
- Rouge: a package for automatic evaluation of summaries. Text Summarization Branches Out. Cited by: §3.1.
- Learning to generate questions by learning what not to generate. arXiv preprint arXiv:1902.10418. Cited by: §1, §3.3, §3.4, Table 1.
- Contextualized non-local neural networks for sequence learning. arXiv preprint arXiv:1811.08600. Cited by: §4.2.
- Effective approaches to attention-based neural machine translation. arXiv preprint arXiv:1508.04025. Cited by: §2.3.
- Generating natural questions about an image. arXiv preprint arXiv:1603.06059. Cited by: §1.
- Generating instruction automatically for the reading strategy of self-questioning.. Cited by: §1, §4.1.
- Learning conditioned graph structures for interpretable visual question answering. In Advances in Neural Information Processing Systems, pp. 8344–8353. Cited by: §4.2.
- Reward augmented maximum likelihood for neural structured prediction. In Advances In Neural Information Processing Systems, pp. 1723–1731. Cited by: §1, §2.4.
- BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics, pp. 311–318. Cited by: §3.1.
- A deep reinforced model for abstractive summarization. arXiv preprint arXiv:1705.04304. Cited by: §1, §2.4, §2.5.
Glove: global vectors for word representation.
Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp. 1532–1543. Cited by: §2.
- Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250. Cited by: 6th item, §3.
- Sequence level training with recurrent neural networks. arXiv preprint arXiv:1511.06732. Cited by: §1, §2.4.
- Self-critical sequence training for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7008–7024. Cited by: §2.4.
- Get to the point: summarization with pointer-generator networks. arXiv preprint arXiv:1704.04368. Cited by: §2.3, §2.3.
- Generating factoid questions with recurrent neural networks: the 30m factoid question-answer corpus. arXiv preprint arXiv:1603.06807. Cited by: §1.
- Leveraging context information for natural question generation. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pp. 569–574. Cited by: §1, §1, §3.1, §4.1.
- A unified query-based generative model for question generation and question answering. arXiv preprint arXiv:1709.01058. Cited by: §1, §1, §1, §1, §2.3, §3.3, Table 1, §4.1.
- A graph-to-sequence model for amr-to-text generation. arXiv preprint arXiv:1805.02473. Cited by: §4.2.
- Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pp. 3104–3112. Cited by: §4.2.
- Question answering and question generation as dual tasks. arXiv preprint arXiv:1706.02027. Cited by: §1.
- Modeling coverage for neural machine translation. arXiv preprint arXiv:1601.04811. Cited by: §2.3.
- Pointer networks. In Advances in Neural Information Processing Systems, pp. 2692–2700. Cited by: §2.3.
- A learning algorithm for continually running fully recurrent neural networks. Neural computation 1 (2), pp. 270–280. Cited by: §2.4.
- Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning 8 (3-4), pp. 229–256. Cited by: §1, §2.4.
- Google’s neural machine translation system: bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144. Cited by: §1, §2.4, §2.5.
- Graph2seq: graph to sequence learning with attention-based neural networks. arXiv preprint arXiv:1804.00823. Cited by: §1, §2.2.2, §2.2.2, §4.2.
- Exploiting rich syntactic information for semantic parsing with graph-to-sequence model. arXiv preprint arXiv:1808.07624. Cited by: §4.2.
- SQL-to-text generation with graph-to-sequence model. arXiv preprint arXiv:1809.05255. Cited by: §4.2.
- Machine comprehension by text-to-text neural question generation. arXiv preprint arXiv:1705.02012. Cited by: §1.
- Neural question generation from text: a preliminary study. In National CCF Conference on Natural Language Processing and Chinese Computing, pp. 662–671. Cited by: §1, §1, §2.1, §3.1, §3.3, Table 1, §4.1.