Existing question answering datasets have enabled significant progress in models that provide extractive or unambigious short answers. However, less attention has been paid to open-ended questions that require explanations. In this work, we present ELI5: a Long Form Question Answering dataset that emphasizes the dual challenges of isolating relevant information within long source documents and generating paragraph-length explanations in response to complex, diverse questions (see illustrations in Figures 1 and 2).
The first challenge of ELI5 is the length and diversity of answers that span multiple sentences: questions are complex and cannot be easily addressed by a short response Nguyen et al. (2016) or by extracting a word or phrase from an evidence document Rajpurkar et al. (2016). Answers also represent one of several valid ways of addressing the query. Many state-of-the-art question answering models perform well compared to human performance for extractive answer selection Radford et al. (2018); Devlin et al. (2018). However, their success does not directly carry over to our setting.
The second challenge is the length and diversity of the content from knowledge sources required to answer our questions. We leverage evidence queried from the web for each question. In contrast to previous datasets where the human written answer could be found with lexical overlap methods Weissenborn et al. (2017), ELI5 poses a significant challenge in siphoning out important information, as no single sentence or phrase contains the full answer. While there are some datasets that do require multi-sentence supporting knowledge such as TriviaQA Joshi et al. (2017), their answers are still short.
We benchmark the performance of several extractive, retrieval, and generative models. Evaluation of our task, and of multi-sentence text generation in general, is challenging. We draw upon several evaluation metrics that quantify performance on intermediary fill-in tasks that lead up to the full answer generation. The overall answer generation quality is measured with ROUGELin (2004) and various human evaluation studies.
We develop a strong abstractive baseline by training a Seq2Seq model on multiple tasks over the same data: language modeling, masked word prediction Devlin et al. (2018) and answer generation. We show this approach outperforms conventional Seq2Seq and language modeling, as well as a strong extractive baseline based on BidAF Seo et al. (2017) but generalized to multi-sentence output. However, our best-performing model is still far from the quality of human written answers, with raters preferring the gold answers 86% of the time. Further, we show that model performance is strongly limited by the ability to comprehend long multi-document input and generate long outputs to form a comprehensive answer, leaving this challenge for future research.
2 Related Work
|Dataset||Average # of Words||1st Question Word Frequency (%)|
|Question||Document(s)||Answer||Why||How||What||When||Where||Who||Which||OTHER||# Q-A Pairs|
|MS MARCO v2 (Nguyen et al., 2016)||6.4||56||13.8||1.7||16.8||35.0||2.7||3.5||3.3||1.8||35.3||183K|
|TriviaQA (Joshi et al., 2017)||14||2895||2.0||0.2||3.9||32.6||2.0||2.1||16.8||41.8||0.6||110K|
|NarrativeQA (Kocisky et al., 2018)||9.8||656||4.7||9.8||10.7||38.0||1.7||7.5||23.4||2.2||6.8||47K|
|CoQA (Reddy et al., 2018)||5.5||271||2.7||2||5||27||2||5||15||1||43||127K|
|SQuAD (2.0) (Rajpurkar et al., 2018)||9.9||116.6||3.2||1.4||8.9||45.3||6.0||3.6||9.6||4.4||17.6||150K|
|HotpotQA (Yang et al., 2018)||17.8||917||2.2||0.1||2.6||37.2||2.8||2.2||13.8||28.5||12.8||113K|
Various QA datasets have been proposed in roughly two categories: extractive answers and short abstractive answers (see Table 1).
Extractive question answering datasets such as TREC Voorhees (2003), SQuAD Rajpurkar et al. (2016, 2018), NewsQA Trischler et al. (2017), SearchQA Dunn et al. (2017), and QuAC Choi et al. (2018) constrain the answer to a word or short phrase from the input and evaluate using exact match or F1 with the ground truth span. HotpotQA Yang et al. (2018) extends this approach by building questions which challenge models to conduct multi-hop reasoning across multiple paragraphs, but the answer is still a short span. Further, the answer must be straightforward, as it needs to be copied from the supporting evidence — precluding most “how” or “why” type questions.
Abstractive datasets include NarrativeQA Kocisky et al. (2018), a dataset of movie and book summaries and CoQA Reddy et al. (2018), a multi-domain dialogue dataset. Both collect responses with crowdworkers and find that written answers are mostly extractive and short. MS MARCO Nguyen et al. (2016), a dataset of crowdsourced responses to Bing queries, has written answers around 1 sentence long with short input passages. TriviaQA Joshi et al. (2017) contains longer multi-document web input, collected using Bing and Wikipedia. As the dataset is built from trivia, most questions can be answered with a short extractive span.
task of writing a paragraph length response from multiple supporting documents can be seen as a form of query-based multi-document summarizationTombros and Sanderson (1998). Summarization tasks such as DUC 2004222https://duc.nist.gov/duc2004/ involve long input and multi-sentence generation, but contain much less training data compared to ELI5. WikiSum Liu et al. (2018) proposes writing Wikipedia articles as a multi-document summarization task. ELI5 requires more directed text generation to answer a question, rather than to write about a general topic. In addition, ELI5 contains a diverse set of questions which can involve more than one Wikipedia concept.
3 Making a Long Form QA Dataset
3.1 Creating the Dataset from ELI5
There are several websites which provide forums to ask open-ended questions such as Yahoo Answers, Quora, as well as numerous Reddit forums, or subreddits. We focus on the subreddit Explain Like I’m Five (ELI5) where users are encouraged to provide answers which are comprehensible by a five year old.333https://www.reddit.com/r/explainlikeimfive ELI5 is appealing because answers are supposed to be entirely self contained, and thus rely less on pre-existing knowledge of the world and use simpler language that is easier to model.
Questions and answers.
We select a set of questions and answers from the ELI5 forum up to July 2018 and then filter it based on how users rated these pairs. First, we only retain questions which have a score of at least two, that is two more ‘up-votes’ than ‘down-votes’. Second, there must be at least one answer with a score of at least two. This yields a final number of 272K questions, and ensures that at least one person other than the author has read the thread and deemed it appropriate. For each thread, we select the answer with the highest voting score as the reference. Note that 63% have one or more other valid answers by our up-vote criteria, potentially doubling the size of the available training data.
Preparing supporting information.
Next, we collect web sources for every question to provide relevant information that a system can draw upon when generating an answer. Wikipedia has been found effective for factoid-oriented questions (Joshi et al., 2017; Chen et al., 2017). However, early experiments in our setting showed it to be insufficient to cover the wide range of topics present in ELI5 and to address the open-ended nature of the questions. Instead, we use web data provided by Common Crawl.444http://commoncrawl.org Specifically, we consider each of the individual pages in the July 2018 archive (roughly one per URL) as a single document. The data is tokenized with Spacy555https://spacy.io and we select English documents with FastText language identification (Bojanowski et al., 2017). Finally, we index the data with Apache Lucene.666http://lucene.apache.org
Creating support documents.
We query the index for the 272K questions and gather the 100 most relevant web sources for each question, excluding Reddit. Each web source
is the extracted text of one page in Common Crawl. This leads to supporting text for each question of a few hundred thousand words. There is a good chance that the supporting text contains the necessary information to answer the question, but the sheer amount of data is far beyond the scope of what many modern models can handle. We therefore filter the 100 web sources by selecting specific passages using a simple heuristic: we split each web source into sentences, find sentences with the highest TFIDF similarity with respect to the question, add some local context for each of these, and concatenate the result into a singlesupport document, with special tokens indicating non-contiguous passages and document shifts. Each support document is the result of this processing to concatenate relevant information from the web sources.
We find that extracting 15 passages with a context of one sentence before and after the initial selection provides the best trade-off between support document length and likelihood of containing relevant information, where relevance is measured as the likelihood of containing a sentence which has high ROUGE with the answer. We release all 100 Common Crawl IDs for each question and a script to create the support document so future research can use the support document or choose to further investigate the information retrieval problem.
Finalizing the data set.
If the training data contains questions that are too similar to the validation and test data, a model may perform well on these examples by memorizing related examples. We prevent this by building the validation and test set to contain questions that are sufficiently different from the training data. We compute the TFIDF similarity between each pair of questions in the entire dataset and sample the validation and test set from the subset which has no close neighbor by TFIDF score. The final dataset contains 237K train examples, 10K for valid, and 25K for test.
3.2 Dataset Analysis
|% Correct Human Answers||94.5|
|% Correct Human Answers with Explanation||90.2|
|% Support Document contains Full Answer||65.0|
|% Support Document contains Relevant Info||92.0|
Table 1 compares ELI5 to related datasets in terms of the length of the question, support document, answer, as well as statistics on the question types.
First, ELI5 questions are much longer than in other datasets. This is because the initial question is often followed by a clarifying paragraph detailing what aspect of the general theme should be addressed or the question’s starting assumptions, which need to be considered to answer well. To get a rough idea of the different questions, we categorize them based on interrogative words. ELI5 focuses on open-ended queries which are less represented in other extractive or abstractive datasets. Figure 2 shows examples of ELI5 questions split by type and Appendix Figure 11 displays random examples from the ELI5 training set. Interestingly, even What questions tend to require paragraph-length explanations (What is the difference…).
Support documents contain - sentences or on average words, which puts ELI5 on the higher end of published datasets for document length. ELI5 contains long-form answers with an average length of sentences, or words.
Next, we analyze a random subset of ELI5 to assess the feasability of answering the questions in the dataset. We judge if the question is answerable by reading each question, the gold answer, and the support document we have created with TF-IDF extraction. Note that questions can have multiple parts and all parts of the question must be answered. We sample 500 randomly question-answer pairs from the training set and find that 94.5% of gold answers fully address the question (Table 2) based on the information in the support document. Figure 12 in Appendix F displays examples of human answers that do not correctly answer the question. A small proportion of answers are correct but do not explain the answer. On the support document side, 65% of the support documents we construct provide the complete answer to the question, and 92% of support documents provide information relevant to the question.
4 Evaluation Methods
Evaluating long-form answers.
There are several aspects to quality: answers should be topical and accurate, fluent, and coherent from start to end. We judge the accuracy aspect by comparing to the gold answer. ROUGE Lin (2004) measures similarity between a model output and one or several references, and is often used in summarization. While our task presents different challenges, such as the diversity of possible answers to a question, we still find the corpus-level metric to be useful to rank different related models (§6). We report F1 for ROUGE-1, ROUGE-2, and ROUGE-L.
Abstractive model metrics.
For generative models, perplexity (PPL) measures the ability to predict the next word in a sequence given its context. For a variant which focuses on semantically important words, we report FILL-1, the accuracy at which models generate different Nouns, Verbs, and Adjectives given the correct preceding tokens in the first 2K examples of the test set. Finally, ROUGE-20% measures the model’s ability to complete an answer given the first 80% of the reference answer, the question, and the support document. Specifically, we generate a number of tokens corresponding to 20% of the average answer length in the validation set, and measure ROUGE between these and the last 20% of the reference. We mentioned that there are several valid ways to answer most questions. This measure abstracts away this variability and evaluates a system’s ability to complete an answer.
We use crowdworkers to conduct three assessments. First, evaluators rate the fluency of human and model generated answers on a 5-point Likert Scale, from “very poorly written” to “easily readable” (500 evaluations). Second, evaluators are given question-answer pairs and are asked if the answer is correct (500 evaluations) 777We experimented with a variant where crowdworkers were allowed to select a third I don’t know option, but found it was used only around 8% of the time.. We also evaluated a smaller subset ourselves while additionally looking at the support documents (100 evaluations) to assess answer accuracy. Lastly, crowdworkers are given the question and answers from two models and asked to decide which answer they prefer while considering readability and accuracy (1000 evaluations). Each crowdworker assessment is made by 3 different evaluators. The same questions are used for all models and must be at least 5 words long.
5.1 Extractive and Retrieval Models
Retrieval baseline and oracle.
We report ROUGE for a retrieval system that returns the answer of the closest question in the training set. Specifically, we perform a nearest neighbor search (Johnson et al., 2017) over the average word embeddings of the question using fasttext Bojanowski et al. (2017). We also compute an approximate oracle score for extractive systems by using the reference answer to select similar sentences from the support document to maximize ROUGE. Computing ROUGE between the reference and all sets of sentences from the source is intractable. Instead, we perform a beam search that adds sentences maximizing TFIDF with respect to the answer. The final beam is re-ranked using ROUGE with respect to the reference answer. We run this algorithm on our support document and on the full set of web sources for each validation and test question, selecting up to 10 sentences with a beam of size 10.
The first baseline we explore simply returns the sentences from the support document which have the highest TFIDF similarity with the question. We also evaluate models which score sentences from the support document based on the question and return the highest scoring sentences in their original order (the number is tuned on the validation set to maximize ROUGE). We train a model based on BidAF Seo et al. (2017). We create an extractive training set by finding the span of up to contiguous sentences in the support document which have the highest ROUGE with respect to the reference answer, and sub-sample other support document sentences so that the final training document is shorter than words. We then train a BidAF model to predict the extracted span in the sub-sampled support document based on the question. For test, we compute the span score for each individual sentence, and return the with the highest score as it performed best compared to returning 3 or 7 sentences.
5.2 Abstractive Models
Language and Seq2Seq models.
We train several models based on the Transformer architecture Vaswani et al. (2017), both in its language model and sequence-to-sequence (Seq2Seq) configurations. To investigate how much information from the document the model uses, we train a language model on the concatenation of Question, Support Document, and Answer (Q + D + A) as well as on the Question and Answer (Q + A). Similarly, one Seq2Seq configuration goes from Q to A, and the other from Q + D to A. In all cases, Q, D, and A are separated by special tokens.
Language models are trained to predict all tokens in the question, web source, and answer. However, the standard Seq2Seq model only receives training signal from predicting the answer which is much less than the language model gets. This can contribute to learning poor quality representations compared to language models. To address this, we train a multi-task Seq2Seq model: during training, we multi-task between several generation tasks, including language modeling of Q + D + A by the decoder and variations of source/target pairs (see Appendix A). We add a masked word prediction task Devlin et al. (2018) where 15% of tokens in the input are masked and must be recovered by the model in the correct order, and append a marker at the start of each sequence to indicate the task.
|Oracle support doc||-|
|Oracle web sources||-|
|LM Q + A|
|LM Q + D + A|
|Seq2Seq Q to A|
|Seq2Seq Q + D to A|
|LM Q + A||31.0||29.6||20.6||26.5||7.0||21.1|
|LM Q + D + A||30.9||28.9||19.9||26.3||7.8||21.3|
|S2S Q to A||21.7||23.0||15.5||33.6||11.5||29.5|
|S2S Q + D to A||27.6||26.3||19.4||32.7||10.7||28.6|
To reduce the vocabulary, we apply byte-pair encoding Sennrich et al. (2016) to generate 40K codes which are applied to all datasets. We model a vocabulary of 52,863 tokens for answer generation. We use the Transformer implementation of fairseq-py Gehring et al. (2017) and train with the big architecture following the details in Vaswani et al. (2017). Given our data length, we train with a large batch size by delaying gradient updates until a sufficient number of examples have been seen Ott et al. (2018).
We generate from abstractive models using beam search with beam 5. We disallow repeated trigrams to prevent repetition, a technique commonly used in multi-sentence summarization Paulus et al. (2017); Fan et al. (2018). For the full answer generation task, we tune a minimum and maximum length for generation on the valid set and apply these settings to the test set.
6.1 Overview of Model Performance
Full answer Rouge.
Table 3 shows that the nearest neighbor baseline performs similarly to simply returning the support document which indicates that memorizing answers from the training set is insufficient. For extractive models, the oracle provides an approximate upper bound of ROUGE-1. The BidAF model is the strongest (), better than TFIDF between the question and the support document to select sentences. However, these approaches are limited by the support document, as an oracle computed on the full web sources achieves .
Abstractive methods achieve higher ROUGE, likely because they can adapt to the domain shift between the web sources and the ELI5 subreddit. In general, Seq2Seq models perform better than language models and the various Seq2Seq settings do not show large ROUGE differences. Figure 3 shows an example of generation for the language model and the best Seq2Seq and extractive settings (see Appendix F for additional random examples).
Perplexity and fill-in tasks.
Tables 3 and 4 present metrics specific to sequential generation models: perplexity of the answer, accuracy of the model’s FILL-1 word prediction for Nouns, Verbs, and Adjectives, and ROUGE of the conditional generation of the last answer words. The language model perplexity is much lower than that of the standard Seq2Seq setting – this is likely linked to the number of output tokens the system is required to predict at training time. The multi-task Seq2Seq experiment, in which the Seq2Seq decoder is trained to predict the question and the document, in addition to the answer, can reach the same perplexity as the language model. ROUGE-20% shows a much starker contrast between language modeling and Seq2Seq, as well as between standard Seq2Seq and multi-task training. The latter achieves strong performance of ROUGE-1. However, both versions of the language model are still better at FILL-1. These results suggest that the Seq2Seq model is better than the language model in maintaining coherence and that Seq2Seq relies on information over many time steps.
Human answers are rated highest in terms of fluency (Figure 4, left). The extractive model outputs human-written text which is likely fluent but with the failure mode of concatenating unrelated sentences. The multi-task model performs similarly to the extractive model which indicates that abstractive methods can generate coherent answers. The language model and standard Seq2Seq trail behind.
To get a sense of the stability of our results, we analyzed the standard deviation of three independent fluency trials conducted on separate days and we find low variation (AppendixE, Figure 10). We also measure agreement between crowdworkers in selecting positive (scores 4 and 5), negative (1 and 2), or neutral (3) choices on the 5-point Likert scale, and find that 2 crowdworkers agree almost 100% of the time (Appendix E, Figure 10).
In answer accuracy (Figure 4, middle), there is a large gap between human performance and all models. The language model is almost never accurate, while the extractive model is slightly more so than the multi-task model. Crowdworkers assessing accuracy do not have the support document. We evaluate accuracy ourselves with the support document in Figure 4, right. Similar to crowdworkers, we find 40% of extractive answers to be accurate. We find only 19% of multi-task model answers are fully accurate; even if the model output answers the question, it can generate a sentence with an incorrect statement. In contrast, the extractive model copies sentences from human-written text. However, the multi-task model is better at generating relevant answers (84% relevancy compared to 68% for extractive), as the extractive model is constrained by the support document.
Figure 5 presents pairwise preference judgments of human annotators shown answers from two of the five systems. The reference answer is preferred over the output of all of our trained models in at least of cases, indicating there is substantial room for improvement. The multi-task abstractive setting comes next, closely followed by the extractive (multi-task is only preferred in of comparisons), then the standard Seq2Seq and finally the language model, considered worse than any other setting in at least of cases.
We use a two-tailed binomial test to test statistical significance of the pairwise judgments and it shows that all judgments are statistically significant at.
6.2 Quantitative and Qualitative Analysis
Discussion of the proposed metrics.
We present a number of metrics which provide insight into various model behaviors. We recommend future work to report full ROUGE and ROUGE-20%. Perplexity and FILL-1 focus on local prediction and are poor indicators of overall appropriateness for the full task. Full answer ROUGE discriminates reasonably well between models with the same general architecture, but cannot rate an abstractive system against an extractive one. The ROUGE-20% measure abstracts away some variability and focuses on coherence between the beginning and end of an answer. This metric correlates with human judgments of quality but can only be reported for sequential generation.
Analysis of extractive, LM and Seq2Seq models.
Language models perform better than Seq2Seq in terms of perplexity and FILL-1, while being significantly worse at ROUGE-20% and human evaluations. To investigate this, we visualize the attention mechanism at the start of answer generation in Figure 6. The attention of the language model is strongly focused on nearby context when generating the first word of the answer, whereas the multi-task Seq2Seq model attends more evenly to relevant information in the question and the document. This validates our assumption that the language model’s focus on local context is insufficient for high quality answers.
In Figure 7 (left), we further investigate how the relevance and quality of the support document extraction step affects the answers provided by the extractive and abstractive setting. The ROUGE score is displayed for data subsets, partitioned by percentile of word overlap of the answer with the support document (e.g. how many answer words appear). While both models perform better for documents with higher ROUGE overlap between support document and human answer, the abstractive setting is much better at compensating for when the support document has lower relevance.
Data size and initial selection.
There is a large difference between the extractive oracle ROUGE using our support document and the oracle on full web sources. This suggests that the initial selection of our support document severely limits access to relevant information. To assess the impact of support document size, we re-run the selection step for examples to extract passages instead of , and run the oracle on these new inputs. Figure 8 shows the TFIDF rank of the passages from which sentences are selected. While slightly more sentences are extracted from the higher ranking passages, less than come from the first , and most oracles have at least one sentence from the last . For a model to perform best, it would have to handle inputs tens of thousands of words long. In Table 3, we show an oracle computed on the full web sources has much higher ROUGE than an oracle computed on the support document.
We analyze the impact of data size on performance in Figure 7. We train the multi-task model on 25%, 50%, and 75%, and the all of the data to compare performance. ROUGE increases as a function of the data used and even though ELI5 is one of the larger QA datasets (§3), this shows that collecting more still helps. While we only used one reference answer per question here, recall that over half of them have multiple answers, which could be leveraged to train better models.
Our task blends the inter-dependent challenges of retrieving information, reasoning, and writing long outputs. Studying each of these aspects in context is particularly important. For example, we show that the abstractive model’s ability to compensate for a (realistically) imperfect support document is essential to its relative success over extractive methods. The fluency gap between the reference and the extractive system in human evaluation also suggests that the latter may require sequential decision capabilities. This kind of decision making is necessary to address the dual challenges of reasoning over several supporting facts and generating long coherent outputs. We see our task’s need to combine complementary systems as critical to gaining insights into their individual behaviors.
We introduce the first large-scale long form question answering dataset of open-ended queries with explanatory multi-sentence answers. We show that abstractive models generate coherent answers and are competitive with extractive models in human evaluation. Proposed models are far from human performance, in part due to the inability to exploit the long full web text. We hope ELI5 will inspire future work in all aspects of long-form QA, from the information extraction problem of obtaining information from long, multi-document input to generating more coherent and accurate paragraph-length answers.
Bojanowski et al. (2017)
Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017.
Enriching word vectors with subword information.TACL, 5:135–146.
- Chen et al. (2017) Danqi Chen, Adam Fisch, Jason Weston, and Antoine Bordes. 2017. Reading wikipedia to answer open-domain questions. In ACL.
- Choi et al. (2018) Eunsol Choi, He He, Mohit Iyyer, Mark Yatskar, Wen tau Yih, Yejin Choi, Percy Liang, and Luke Zettlemoyer. 2018. Quac: Question answering in context. In EMNLP.
- Devlin et al. (2018) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. CoRR, abs/1810.04805.
- Dunn et al. (2017) Matthew Dunn, Levent Sagun, Mike Higgins, V. Ugur Güney, Volkan Cirik, and Kyunghyun Cho. 2017. Searchqa: A new q&a dataset augmented with context from a search engine. CoRR, abs/1704.05179.
Fan et al. (2018)
Angela Fan, David Grangier, and Michael Auli. 2018.
ACL Workshop on Neural Machine Translation and Generation.
- Gehring et al. (2017) Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann N Dauphin. 2017. Convolutional Sequence to Sequence Learning. In Proc. of ICML.
- Johnson et al. (2017) Jeff Johnson, Matthijs Douze, and Hervé Jégou. 2017. Billion-scale similarity search with gpus. CoRR, abs/1702.08734.
- Joshi et al. (2017) Mandar Joshi, Eunsol Choi, Daniel Weld, and Luke Zettlemoyer. 2017. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. In ACL.
- Kocisky et al. (2018) Tomas Kocisky, Jonathan Schwarz, Phil Blunsom, Chris Dyer, Karl Moritz Hermann, Gabor Melis, and Edward Grefenstette. 2018. The narrativeqa reading comprehension challenge. TACL.
- Lin (2004) Chin-Yew Lin. 2004. Rouge: a package for automatic evaluation of summaries. In ACL Workshop on Text Summarization Branches Out.
- Liu et al. (2018) Peter J. Liu, Mohammad Saleh, Etienne Pot, Ben Goodrich, Ryan Sepassi, Lukasz Kaiser, and Noam Shazeer. 2018. Generating wikipedia by summarizing long sequences. In ICLR.
- Nguyen et al. (2016) Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng Gao, Saurabh Tiwary, Rangan Majumder, and Li Deng. 2016. Ms marco: A human generated machine reading comprehension dataset. CoRR.
- Ott et al. (2018) Myle Ott, Sergey Edunov, David Grangier, and Michael Auli. 2018. Scaling neural machine translation. In WMT, pages 1–9. Association for Computational Linguistics.
- Paulus et al. (2017) Romain Paulus, Caiming Xiong, and Richard Socher. 2017. A deep reinforced model for abstractive summarization. arXiv preprint arXiv:1705.04304.
- Radford et al. (2018) Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving language understanding by generative pre-training.
- Rajpurkar et al. (2018) Pranav Rajpurkar, Robin Jia, and Percy Liang. 2018. Know what you don’t know: Unanswerable questions for squad. In ACL.
- Rajpurkar et al. (2016) Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. Squad: 100,000+ questions for machine comprehension of text. In EMNLP.
- Reddy et al. (2018) Siva Reddy, Danqi Chen, and Christopher D Manning. 2018. Coqa: A conversational question answering challenge. arXiv preprint arXiv:1808.07042.
- Sennrich et al. (2016) Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Neural machine translation of rare words with subword units. In ACL.
- Seo et al. (2017) Minjoon Seo, Aniruddha Kembhavi, Ali Farhadi, and Hannaneh Hajishirzi. 2017. Bidirectional attention flow for machine comprehension. In ICLR.
- Tombros and Sanderson (1998) Anastasios Tombros and Mark Sanderson. 1998. Advantages of query biased summaries in information retrieval. In SIGIR.
- Trischler et al. (2017) Adam Trischler, Tong Wang, Xingdi Yuan, Justin Harris, Alessandro Sordoni, Philip Bachman, and Kaheer Suleman. 2017. Newsqa: A machine comprehension dataset. In ACL Workshop on Representation Learning for NLP.
- Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. NIPS.
- Voorhees (2003) Ellen M. Voorhees. 2003. Overview of the TREC 2003 question answering track. In TREC.
- Weissenborn et al. (2017) Dirk Weissenborn, Georg Wiese, and Laura Seiffe. 2017. Making neural qa as simple as possible but not simpler. In CoNLL.
- Yang et al. (2018) Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W Cohen, Ruslan Salakhutdinov, and Christopher D Manning. 2018. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. arXiv preprint arXiv:1809.09600.
Appendix A Details of Multitask Training
The seq2seq multi-task model was trained on a variety of tasks at training time. Each task is specified by a special token to delineate to the model which task it is. Tasks at training time include the following, in the form of (source, target) pairs. “+” represents a concatenation of inputs, separated by a special token.
(empty, question + document)
(empty, question + document + answer)
(question + document, answer)
(question, document + answer)
masked word prediction: 15% of source words are replaced by a “[MASK]” token and the corresponding tokens must be predicted as the target in the correct order
Appendix B Architectural Details
b.1 Extractive BidAF
The BidAF model is trained using the AllenNLP888https://allennlp.org/ implementation, using the standard hyper-parameters (specified in the bidaf.jsonnet file999https://github.com/allenai/allennlp/blob/master/training_config/bidaf.jsonnet). We only change the batch size, since a 16GB GPU can only fit one example per batch, and as a result the Adam learning rate has to be changed to . We provide the code to select the target span and sub-sample the input in our data, as well as to convert it to the SQUAD format required by the AllenNLP system.
b.2 Abstractive Models
Models are trained with the Adam optimizer with beta values , initial learning rate with 4000 warmup steps to learning rate 0.0001. We follow the inverse square root learning rate scheduler described in Vaswani et al. (2017). Models are trained with a label smoothing value of 0.1.
Sequence to sequence models are trained with following architecture from Vaswani et al. (2017): 6 encoder layers, 6 decoder layers, FFN dimension 4096, 16 attention heads, embedding dimension 1024. Gradient updating is delayed until 32 updates have been processed. Models are regularized with dropout 0.1 and attention dropout 0.1.
Language models are trained with same parameters described for seq2seq above, with 6 decoder layers. We did not train with 12 decoder layers, as we found the deeper Transformer model was harder to optimize and we achieved worse results compared to a 6-layer language model.
For generation, models generate a minimum of 200 words and a maximum of 500 words.
Appendix C Comparison of Extractive and Abstractive Methods
Figure 13 displays an example of a generated answer for an example where the source document is of poor quality but the abstractive answer still has strong ROUGE. In comparison, the extractive answer is heavily affected by the poor document quality and derails in topic.
Appendix D Test/Valid Similarity with Train
Figure 9 shows the performance of the Multi-task Seq2Seq and LM on Question + Document + Answer by the similarity of the question in the validation set to a question in the training set. The similarity is determined by TFIDF. There is very little effect of answer generation on a question more similar to a training question than less similar.
Appendix E Variance in Human Evaluation Studies
We analyze the variation of our human evaluation study for answer generation fluency in Figure 10. We conduct 3 different trials of the same 100 randomly sampled question-answer pairs from the test set for the selected models. Each trial is conducted on a different day. Our results show that standard deviation across the trials is small and not statistically significant.
Further, each answer is evaluated for fluency by 3 different crowdworkers. Figure 10 analyzes the agreement rate between crowdworkers that can choose on a scale of five options. We term “agreement” if all workers are positive, negative, or neutral about the answer fluency. We show that all three crowdworkers agree around 60% of the time for most models and almost 80% of the time for the language model. As the language model generation is significantly less fluent than the other models, most crowdworkers are in agreement. The agreement of at least two of the annotators is almost 100% for all of our evaluated systems.
Appendix F Examples
(an estimated 5% of the dataset).
To better understand the output of our models, we display example generations randomly sampled from the test set for the multi-task Seq2Seq model (Figure 14) and the Extractive BidAF model (Figure 15). We additionally show a set of poor generations for the multi-task Seq2Seq model (Figure 16) that display a representative set of problems for this abstractive model.