Humans have an extraordinary capability to become immersed in long narrative texts such as novels and plays, relive the stories, sympathize with characters, and understand the key messages that authors intend to convey. This set of skills defines the core of reading comprehension, a problem that remains unsolved by machines.
What makes story comprehension challenging is the fact that novels typically feature very long texts written in a variety of styles, and contain long-range dependencies over characters and plot events. Existing datasets such as NarrativeQA kovcisky2018narrativeqa consist of stories with a set of questions and answers as a proxy task for understanding. However, such quizzes provide a very weak form of supervision for an artificial agent, as it is required to learn to extract relevant semantic information from an extremely large volume of text with only a few QA pairs. This is likely one of the reasons that existing methods on summarization see2017get; paulus2017deep; WenyuanArxiv16 and question answering hermann2015teaching; hill2015goldilocks; rajpurkar2016squad are often limited to texts with only a few sentences or paragraphs. Recent forays into analyzing longer texts chen2017reading; joshi2017triviaqa; liu2018generating focus on Wikipedia articles or search results which are typically simpler in structure and are fact-based, and thus do not require a high-level of semantic abstraction in order to comprehend.
To address this, we introduce the Shmoop Corpus: a dataset of stories (books and plays) with summaries from Shmoop111https://www.shmoop.com/, a learning guide resource. Rather than summarizing the entire story in a few paragraphs as in Wikipedia plot synopses, these summaries are written for each chapter (see Fig. 1). Each summary compresses the relevant events in the chapter into a few short paragraphs, and is written in a neutral style. Moreover, the paragraphs in the summaries have a loose chronological alignment with paragraphs in the chapter. Just as with human learners for whom these summaries are originally intended, we believe that the alignment also provides a strong supervisory signal for training machine comprehension models.
We showcase this by constructing a benchmark set of NLP tasks from the corpus: Cloze-form question answering and a simplified form of abstractive summarization with multiple choices. We first demonstrate how the chronological structure can help compute alignments. Then, we use this alignment to learn semantic representations of stories that perform well on the tasks, thus demonstrating that this corpus is a key step towards improving reading comprehension for stories.
|CNN/Daily Mail hermann2015teaching||300k||56||781|
|Children’s Book Test222Corpus does not contain summaries. Size is based on the number of contexts which are derived from 108 texts. hill2015goldilocks||700k||-NA-||465|
|MovieQA333Dataset based on movie scripts and plot summaries. MovieQA||199||714||23,877|
|Shmoop Corpus (Ours)||7,234||460||3,579|
We describe the features of the Shmoop corpus.
Data collection. We first retrieved paired summaries and narrative texts from the Shmoop website and Project Gutenberg444Accessed at https://www.gutenberg.org/, respectively. The stories were parsed manually, split into chapters (plays into scenes), and then matched to their summaries. To assess the chronological order between the summaries and stories, we manually labeled a small validation set (5% of the corpus) with alignments. To do this, we split summaries into paragraphs based on their bullet-point structure, and stories based on line breaks (up to 250 words). We then manually aligned the summary and story paragraphs to best match their content.
Statistics. The corpus consists of 231 works (145 novels, 62 plays and 24 short stories) with a total of 7,234 chapter and summary pairs. Table 1 compares our dataset to other narrative or summarization datasets. Our Shmoop corpus strikes a balance between short-form large-scale datasets such as the Children’s Book Test and long-form small-scale corpora like NarrativeQA. At the paragraph level, our dataset has 111k summary paragraphs with 30 words on average, and 436k story paragraphs with an average of 59 words each.
Chronological structure. The manually aligned validation set contains 360 chapters (from 15 stories) with a total of 13.5k aligned summary paragraphs. During alignment, we did not impose any constraints and allowed annotators to skip story paragraphs that did not fit any summary paragraph. Despite this, only 168 (1.2%) of the aligned summary paragraphs deviate from chronological order.
3 Exploiting the Chronological Structure
We take advantage of the chronological structure in a two step process. First, we use it to compute alignments that demonstrate good performance on our validation set. Second, we use these alignments as supervision for our benchmark tasks and show improved performance. Let us define some notation. For any chapter, let denote the summary composed of a sequence of paragraphs , and denote the paragraphs of the story document. An alignment denotes a relationship between and , where , such that iff is aligned to . Note that by aligned, we mean that the summary paragraph partially or wholly encapsulates the content conveyed in the story paragraph .
3.1 Computing Alignments
Let denote a scoring function indicating similarity between and . We define the optimal alignment for a given as: , subject to a set of constraints. Specifically, we impose chronological ordering with the following constraints: and , if then . Similar to prior work on aligning subtitles with scripts everingham2006 or plot synopses sentences with video shots tapaswi2015ijmir we can compute alignments efficiently with dynamic programming.
We analyze a variety of alignments with and without imposing the temporal ordering. See Fig. 2 for an illustration of each approach.
Random-. Not considering the similarity score leads to a random alignment. Here, randomly selected consecutive are aligned to each .
Diagonal-. Imposing temporal order without considering similarity scores leads to a diagonal alignment. In particular, story paragraphs are aligned with each summary paragraph.
Similarity scores. We assign to a summary paragraph such that . We compare three different scoring functions . (i) BLEU papineni2002bleu, a classic metric used in translation; (ii) BERT devlin2018bert
, we compute cosine similarity between paragraph representations extracted with the pre-trained BERT model; and (iii) TFIDF, compares paragraphs using words weighted by their frequencies.
Chronological similarity scores. Alignment is computed using dynamic programming with different scoring functions from above.
|Alignment Type||Using||Temporal||Alignment Performance||Task Accuracy|
|No Context (Random Choice)||N/A||N/A||N/A||N/A||N/A||0.109||0.102|
|No Alignment (All Paragraphs)||✗||✗||0.127||1.000||0.206||0.149||0.148|
3.2 Benchmark NLP Tasks
We show the benefits of having an alignment by constructing two benchmark NLP tasks from the corpus: Cloze-form question answering and a simplified form of abstractive summarization. The alignment determines the subset of the story document
used for each task. We choose these tasks to illustrate the benefits of aligned stories and summaries on reading comprehension most effectively. We expect improved performance given better alignments as the model is not required to search through a long text. Note that our key contribution is not a new task but rather a different approach that incorporates alignments to improve story understanding. Additionally, our dataset also serves as a benchmark for unsupervised learning of alignments.
Cloze-Form question answering. We construct this task following hermann2015teaching. A question is constructed by masking an entity in a summary paragraph . Given the subset of story paragraphs from aligned to as determined by , the task is to select the correct answer from a set of candidate answers – up to 10 other entities from the same story. We determine entities using an entity tagger from AllenNLP Gardner2017AllenNLP and use entity anonymization hermann2015teaching to ensure that the model does not learn information about specific entities. For a small percentage of that do not have entities, we mask a random token during training. For evaluation, we report accuracy only on questions with masked entities.
Our model is a modified version of the Attentive Reader (AR) from hermann2015teaching, adapted for multi-paragraph texts. We treat (potentially multiple) story paragraphs aligned to the sentence as a unique document and compute its question-conditioned representation. This is calculated as an attention-weighted sum of the word vectors, where attention is computed with respect to the question representation. We then combine individual paragraph representations into a single feature using another attention-weighted sum, again, attended by the question representation. This essentially amounts to a hierarchical version of AR. More model details are provided in Appendix B.1.
Multi-choice abstractive summarization. To see the benefits of having alignments, we propose a simplified form of abstractive summarization as another task. Given the story document and corresponding alignments , the goal of this task is to complete summary sentences given first tokens. In particular, we make this a multiple-choice problem by creating a set of 10 candidate sentences, 1 correct, and 9 incorrect ones obtained from tokens from other summary paragraphs of the same chapter.
Our model is an LSTM decoder that uses a modified version of the Attentive Reader to attend to the aligned portion of the document at each time step. In particular, we sequentially predict words using an LSTM decoder (similar to other generative summarizers) and rank possible candidate answers by computing the average log-likelihood for each candidate. The use of multiple-choice answers for summarization removes additional ambiguity introduced by sentence comparison metrics (e.g., BLEU, ROUGE). Nevertheless, please note that our LSTM decoder is able to generate a summary paragraph when no candidates are provided. Additional details are in Appendix B.2.
3.3 Learning Alignments
We also experiment with learning an alignment simultaneously with a task of interest (e.g., Cloze QA). This has the potential to obtain ever-increasing performance improvements for both the alignment as well as the task. As the alignment is a latent non-differentiable variable, we follow a Concave-Convex Procedure yu2009learning. The optimization procedure alternates between computing optimal given model parameters using dynamic programming, and learning given with a gradient-based optimizer. More details on the procedure are provided in Appendix C.
Table 2 summarizes experimental results on alignment and NLP tasks studied in this work.
Alignment performance is reported as precision, recall, and F1-score, with each ground-truth aligned pair considered as a sample. We see that even with our simple methods, it is possible to produce reasonably accurate alignments. Our best result, using TFIDF pairwise scores with a chronological constraint, attains 0.452 F-score, whereas random alignments are 0.186 or below. Chronological-TFIDF tends to closely follow the ground-truth and has high recall (see Fig. 2).
Interestingly, while large-scale corpora models like BERT have been successful on many tasks, story-understanding still poses a significant challenge. Using sentence representations from a pre-trained BERT model performs similar to random scoring and far worse than word matching: F-score of 0.25 for Chronological-BERT, 0.264 for Chronological-Random and 0.452 for Chronological-TFIDF. We speculate that this is due to the complexity of stories and summaries (e.g.
, tracking long-range dependencies, high variance in linguistic style), and the generally limited availability of summary-story pairs in the wild.
Benchmark NLP Tasks. We use accuracy as a metric for both Cloze and summarization tasks. In general, using an alignment helps improve performance, and alignments that take advantage of the chronological structure perform better on both studied tasks. Without using any alignment, accuracy on the tasks is at 0.149 and 0.148.
We categorize methods based on whether they have access to the similarity scoring function and whether they use temporal order (diagonal or DP). Among methods that do not have access to , non-temporal methods (Random-N) achieve an accuracy of 0.255 and 0.144. In comparison, methods that use temporal order (Diagonal-N) obtain higher accuracy 0.356 and 0.149, respectively. Among methods with access to , using temporal order does improve performance, but is less pronounced. For example, Chronological-TFIDF obtains an accuracy of 0.407 and 0.154, while TFIDF achieves 0.367 and 0.152.
This demonstrates that the chronological order present in the corpus is useful for natural language tasks, in particular when no prior information about the text is available. A reason for the small improvements on summarization is the task difficulty. On the other hand, we see that Cloze-form QA is made much more accessible by leveraging alignment information.
Simultaneously learning alignments. Jointly optimizing both and model parameters for the Cloze task produces alignments with an F-score of 0.355. While this is more accurate than other methods such as Chronological-BERT, it is lower than the Chronological-TFIDF approach. Learning alignments with weak supervision remains an active area of research zhu2015aligning; raffel2017online; luo2017learning, and is an interesting task that our corpus facilitates.
We introduced a corpus of stories and loosely-aligned summaries, as well as a set of benchmark tasks for story-based reading comprehension. We showed that the corpus’s structural properties, in the form of temporal ordering, are key to learning effective representations for these tasks. This is the core value of our corpus, as it makes the challenge of learning on stories much more accessible. Beyond the tasks we showed here, we believe this corpus can be built upon to expand to other challenges such as question answering and learning alignment, pushing the envelope of story understanding in multiple domains. We make the corpus available at https://github.com/achaudhury/shmoop-corpus.
Acknowledgments. We thank Shmoop for creating an amazing learning resource and allowing us to use their summary data for research purposes. The project was supported by DARPA Explainable AI (XAI) and NSERC Cohesa. We thank our Upwork annotators for helping us in annotating the alignments between stories and their summaries.
We present additional details on the benchmark Cloze-form QA and summarization tasks (Appendix A) and their corresponding models in Appendix B. We also discuss the method we adopt for learning alignments simultaneously with the task in Appendix C and present some examples of aligned summary and story paragraphs from our corpora in Appendix D.
Appendix A Task Construction Details
a.1 Cloze-Form Question Answering
|20,000 Leagues Under the Sea|
|[MASK] obviously wants to find his Giant Narwhal as well. Only Conseil seems uninterested.|
|Multiple Choice Answers:|
|a) Aronnax f) Giant Narwhal|
|b) Nautilus g) Higginson|
|c) Conseil h) Moby Dick|
|d) Ned i) Vanikoro|
|e) Nemo j) Kraken|
|Oedipus the King|
|[MASK] reenters and demands that anyone with information about the former king’s murder speak up. He curses the murderer.|
|Multiple Choice Answers:|
|a) Creon f) Polybus|
|b) Jocasta g) Apollo|
|c) Oedipus h) Sphinx|
|d) Teiresias i) Corinth|
|e) Laius j) Thebes|
We create Cloze-form questions in the following manner. For each summary paragraph, candidate entities for masking are determined using the pre-trained AllenNLP entity tagger Gardner2017AllenNLP, using the PERSON, ORGANIZATION and LOCATION tags. Entities that do not occur in the original text are removed from the candidate set. If multiple entity candidates are found in a single summary paragraph, the entity that appears least frequently in the summary is chosen – thus encouraging diversity of answers. If no entities are present in the summary paragraph, a token is chosen at random to be masked during training, but is not included in reporting the final performance.
a.2 Multi-choice Abstractive Summarization
In this task, our goal is to complete the summary bullet point given the first words of the paragraph, the original story and an alignment . Summaries that have less than tokens are not considered in this task (both as correct or contrastive answers). Instead of generating a new sentence, we wish to complete the sentence by choosing one among 10 possible candidate sentences. The wrong candidates consist of other summary paragraphs from the same chapter (ignoring the first tokens).
In particular, for a model that consists of a standard RNN decoder (e.g., LSTM), and we teacher force the entire summary paragraph and train the parameters using cross-entropy loss. At test time, we seed (teacher-force) the model with the first
tokens of the desired summary paragraph. Then, we compute the average log-likelihood (probabilities) of all remaining tokens in each candidate to rank them.
Appendix B Model Details
b.1 Cloze-Form Question Answering
We first encode the question (summary paragraph with a blank) and story paragraphs with a word embedding and Bi-LSTM to obtain and (Eq. 1, 2). We then use our modified Attentive Reader to generate a conditional document representation based on the encoded mask token representation , and alignment (Eq. 3). Finally an answer vector is computed by adding to and applying a linear layer (Eq. 4). This answer vector is compared against candidate answers using cosine similarity, followed by selecting the highest scoring answer.
Our modified Attentive Reader first constructs a conditional representation for each based on an attention-weighted sum of token representations. Attention is computed as a function of the encoding of the mask token from the question and token representation ( and respectively for brevity) in Eq. 5 and Eq. 6. We then combine the representations into a single vector with an attention-weighted sum once again to produce the final AR output (Eq. 7 and Eq. 8):
where corresponds to the set described by the latent alignment .
Our model parameters constitute a word embedding layer , linear layers , , , and the bi-directional LSTMs and . Dimensionality of embeddings and LSTM hidden states is chosen to be 200. The model parameters are trained via the max-margin loss (margin ) using the Adam optimizer with a learning rate of .
b.2 Multi-choice Abstractive Summarization
We adapt the Attentive Reader from above to the generation task by computing new context vectors for each generated token rather than a single . These are used to create an answer prediction vector at each time step.
Different from the model used for Cloze-form QA, we use the cross-entropy loss at each time step. We use a word embedding and LSTM hidden state dimension of 200, and train our model parameters using the Adam optimizer at a learning rate of .
Appendix C Learning Alignment Details
We provide additional details for our approach towards learning the alignment jointly with the task.
be the loss function for a given task, for a model with parameters, depending on a latent variable . To jointly optimize both and we follow a modified version of the Concave-Convex Optimization Procedure (CCCP) outlined in Algorithm 1.
Notably, the initial step of solving can be computed efficiently due to the alignment constraints with a dynamic programming algorithm as can be made to have a linear dependency on
. For the sake of simplicity use contrastive sampling to estimate the max over all possible alignments during training. During validation, our focus is on determiningand we do not actually compute .
Appendix D Additional Dataset Examples
We present several examples of summary paragraphs and their corresponding aligned story paragraphs. We include an example of a perfect one-to-one paragraph alignment from a novel (A Christmas Carol) in Table 5; a one-to-many alignment common in the presence of story paragraphs consisting of dialogs in a play (A Midsummer Night’s Dream) in Table 6; and a many-to-one alignment in a large paragraph from a short story (The Masque of the Red Death) in Table 7.