The LAMBADA dataset [Paperno et al.2016] was designed by identifying word prediction tasks that require broad context. Each instance is drawn from the BookCorpus [Zhu et al.2015] and consists of a passage of several sentences where the task is to predict the last word of the last sentence. The instances are manually filtered to find cases that are guessable by humans when given the larger context but not when only given the last sentence. The expense of this manual filtering has limited the dataset to only about 10,000 instances which are viewed as development and test data. The training data is taken to be books in the corpus other than those from which the evaluation passages were extracted.
lambada:16 provide baseline results with popular language models and neural network architectures; all achieve zero percent accuracy. The best accuracy is 7.3% obtained by randomly choosing a capitalized word from the passage.
Our approach is based on the observation that in 83% of instances the answer appears in the context. We exploit this in two ways. First, we automatically construct a large training set of 1.8 million instances by simply selecting passages where the answer occurs in the context. Second, we treat the problem as a reading comprehension task similar to the CNN/Daily Mail datasets introduced by deepmind-reader:15, the Children’s Book Test (CBT) of cbt:15, and the Who-did-What dataset of onishi-16-full. We show that standard models for reading comprehension, trained on our automatically generated training set, improve the state of the art on the LAMBADA test set from 7.3% to 49.0%. This is in spite of the fact that these models fail on the 17% of instances in which the answer is not in the context.
We also perform a manual analysis of the LAMBADA task, provide an estimate of human performance, and categorize the instances in terms of the phenomena they test. We find that the comprehension models perform best on instances that require selecting a name from the context based on dialogue or discourse cues, but struggle when required to do coreference resolution or when external knowledge could help in choosing the answer.
2.1 Neural Readers
deepmind-reader:15 developed the CNN/Daily Mail comprehension tasks and introduced question answering models based on neural networks. Many others have been developed since. We refer to these models as “neural readers”. While a detailed survey is beyond our scope, we briefly describe the neural readers used in our experiments: the Stanford [Chen et al.2016], Attention Sum [Kadlec et al.2016], and Gated-Attention [Dhingra et al.2016] Readers. These neural readers use attention based on the question and passage to choose an answer from among the words in the passage. We use d for the context word sequence, q for the question (with a blank to be filled), for the candidate answer list, and for the vocabulary. We describe neural readers in terms of three components:
Embedding and Encoding: Each word in d and q is mapped into a
-dimensional vector via the embedding function, for all .111We overload the function to operate on sequences and denote the embedding of d and q as matrices and . The same embedding function is used for both d and q
. The embeddings are learned from random initialization; no pretrained word embeddings are used. The embedded context is processed by a bidirectional recurrent neural network (RNN) which computes hidden vectorsfor each position :
where and are RNN parameters, and each of and return a sequence of hidden vectors, one for each position in the input . The question is encoded into a single vector which is the concatenation of the final vectors of two RNNs:
The RNNs use either gated recurrent units[Cho et al.2014]Hochreiter and Schmidhuber1997].
Attention: The readers then compute attention weights on positions of using . In general, we define , where ranges over positions in . The function is an inner product in the Attention Sum Reader and a bilinear product in the Stanford Reader. The computed attentions are then passed through a
function to form a probability distribution. The Gated-Attention Reader uses a richer attention architecture[Dhingra et al.2016]; space does not permit a detailed description.
Output and Prediction: To output a prediction , the Stanford Reader computes the attention-weighted sum of the context vectors and then an inner product with each candidate answer:
where is the “output” embedding function. As the Stanford Reader was developed for the anonymized CNN/Daily Mail tasks, only a few entries in the output embedding function needed to be well-trained in their experiments. However, for LAMBADA, correct answers can range over the entirety of , making the output embedding function difficult to train. Therefore we also experiment with a modified version of the Stanford Reader that uses the same embedding function for both input and output words:
where is an additional parameter matrix used to match dimensions and model any additional needed transformation.
For the Attention Sum and Gated-Attention Readers the answer is computed by:
where is the set of positions where appears in context d.
2.2 Training Data Construction
Each LAMBADA instance is divided into a context (4.6 sentences on average) and a target sentence, and the last word of the target sentence is the target word to be predicted. The LAMBADA dataset consists of development (dev) and test (test) sets; lambada:16 also provide a control dataset (control), an unfiltered sample of instances from the BookCorpus.
We construct a new training dataset from the BookCorpus. We restrict it to instances that contain the target word in the context. This decision is natural given our use of neural readers that assume the answer is contained in the passage. We also ensure that the context has at least 50 words and contains 4 or 5 sentences and we require the target sentences to have more than 10 words.
Some neural readers require a candidate target word list to choose from. We list all words in the context as candidate answers, except for punctuation.222This list of punctuation symbols is at https://raw.githubusercontent.com/ZeweiChu/lambada-dataset/master/stopwords/shortlist-stopwords.txt Our new dataset contains 1,827,123 instances in total. We divide it into two parts, a training set (train) of 1,618,782 instances and a validation set (val) of 208,341 instances. These datasets can be found at the authors’ websites.
We use the Stanford Reader [Chen et al.2016], our modified Stanford Reader (Eq. 1), the Attention Sum (AS) Reader [Kadlec et al.2016], and the Gated-Attention (GA) Reader [Dhingra et al.2016]. We also add the simple features from hai-logical:16 to the AS and GA Readers. The features are concatenated to the word embeddings in the context. They include: whether the word appears in the target sentence, the frequency of the word in the context, the position of the word’s first occurrence in the context as a percentage of the context length, and whether the text surrounding the word matches the text surrounding the blank in the target sentence. For the last feature, we only consider matching the left word since the blank is always the last word in the target sentence.
All models are trained end to end without any warm start and without using pretrained embeddings. We train each reader on train
for a max of 10 epochs, stopping when accuracy ondev decreases two epochs in a row. We take the model from the epoch with max dev accuracy and evaluate it on test and control. val is not used.
We evaluate several other baseline systems inspired by those of lambada:16, but we focus on versions that restrict the choice of answers to non-stopwords in the context.333We use the stopword list from richardson2013mctest. We found this strategy to consistently improve performance even though it limits the maximum achievable accuracy.
We consider two -gram language model baselines. We use the SRILM toolkit [Stolcke2002] to estimate a 4-gram model with modified Kneser-Ney smoothing on the combination of train and val. One uses a cache size of 100 and the other does not use a cache. We use each model to score each non-stopword from the context. We also evaluate an LSTM language model. We train it on train, where the loss is cross entropy summed over all positions in each instance. The output vocabulary is the vocabulary of train, approximately 130k word types. At test time, we again limit the search to non-stopwords in the context.
We also test simple baselines that choose particular non-stopwords from the context, including a random one, the first in the context, the last in the context, and the most frequent in the context.
|Baselines [Paperno et al.2016]|
|Random in context||1.6||0||N/A|
|Random cap. in context||7.3||0||N/A|
|-gram + cache||0.1||19.1||N/A|
|Our context-restricted non-stopword baselines|
|Our context-restricted language model baselines|
|-gram + cache||11.8||2.2||15.6|
|Our neural reader results|
|Modified Stanford Reader||32.1||7.4||52.3|
|AS Reader + features||44.5||8.6||60.6|
|GA Reader + features||49.0||9.3||65.6|
Table 1 shows our results. We report accuracies on the entirety of test and control (“all”), as well as separately on the part of control where the target word is in the context (“context”). The first part of the table shows results from lambada:16. We then show our baselines that choose a word from the context. Choosing the most frequent yields a surprisingly high accuracy of 11.7%, which is better than all results from Paperno et al.
Our language models perform comparably, with the -gram + cache model doing best. By forcing language models to select a word from the context, the accuracy on test is much higher than the analogous models from Paperno et al., though accuracy suffers on control.
We then show results with the neural readers, showing that they give much higher accuracies on test than all other methods. The GA Reader with the simple additional features [Wang et al.2016] yields the highest accuracy, reaching 49.0%. We also measured the “top ” accuracy of this model, where we give the model credit if the correct answer is among the top ranked answers. On test, we reach 65.4% top-2 accuracy and 72.8% top-3.
The AS and GA Readers work much better than the Stanford Reader. One cause appears to be that the Stanford Reader learns distinct embeddings for input and answer words, as discussed above. Our modified Stanford Reader, which uses only a single set of word embeddings, improves by 10.4% absolute. Since the AS and GA Readers merely score words in the context, they do not learn separate answer word embeddings and therefore do not suffer from this effect.
We suspect the remaining accuracy difference between the Stanford and the other readers is due to the difference in the output function. The Stanford Reader was developed for the CNN and Daily Mail datasets, in which correct answers are anonymized entity identifiers which are reused across instances. Since the identifier embeddings are observed so frequently in the training data, they are frequently updated. In our setting, however, answers are words from a large vocabulary, so many of the word embeddings of correct answers may be undertrained. This could potentially be addressed by augmenting the word embeddings with identifiers to obtain some of the modeling benefits of anonymization [Wang et al.2016].
All context restricted models yield poor accuracies on the entirety of control. This is due to the fact that only 14.1% of control instances have the target word in the context, so this sets the upper bound that these models can achieve.
4.1 Manual Analysis
One annotator, a native English speaker, sampled 100 instances randomly from dev, hid the final word, and attempted to guess it from the context and target sentence. The annotator was correct in 86 cases. For the subset that contained the answer in the context, the annotator was correct in 79 of 87 cases. Even though two annotators were able to correctly answer all LAMBADA instances during dataset construction [Paperno et al.2016], our results give an estimate of how often a third would agree. The annotator did the same on 100 instances randomly sampled from control, guessing correctly in 36 cases. These results are reported in Table 1. The annotator was correct on 6 of the 12 control instances in which the answer was contained in the context.
|single name cue||9||89%||100%|
|simple speaker tracking||19||84%||100%|
|discourse inference rule||16||50%||88%|
We analyzed the 100 LAMBADA dev instances, tagging each with labels indicating the minimal kinds of understanding needed to answer it correctly.444The annotations are available from the authors’ websites. Each instance can have multiple labels. We briefly describe each label below:
single name cue: the answer is clearly a name according to contextual cues and only a single name is mentioned in the context.
simple speaker tracking: instance can be answered merely by tracking who is speaking without understanding what they are saying.
basic reference: answer is a reference to something mentioned in the context; simple understanding/context matching suffices.
discourse inference rule: answer can be found by applying a single discourse inference rule, such as the rule: “ left and went in search of ” .
semantic trigger: amorphous semantic information is needed to choose the answer, typically related to event sequences or dialogue turns, e.g., a customer says “Where is the ?” and a supplier responds “We got plenty of ”.
coreference: instance requires non-trivial coreference resolution to solve correctly, typically the resolution of anaphoric pronouns.
external knowledge: some particular external knowledge is needed to choose the answer.
Table 2 shows the breakdown of these labels across instances, as well as the accuracy on each label of the GA Reader with features.
The GA Reader performs well on instances involving shallower, more surface-level cues. In 9 cases, the answer is clearly a name based on contextual cues in the target sentence and there is only one name in the context; the reader answers all but one correctly. When only simple speaker tracking is needed (19 cases), the reader gets 84% correct.
The hardest instances are those that involve deeper understanding, like semantic links, coreference resolution, and external knowledge. While external knowledge is difficult to define, we chose this label when we were able to explicitly write down the knowledge that one would use when answering the instances, e.g., one instance requires knowing that “when something explodes, noise emanates from it”. These instances make up nearly a quarter of those we analyzed, making LAMBADA a good task for work in leveraging external knowledge for language understanding.
On control, while our readers outperform our other baselines, they are outperformed by the language modeling baselines from Paperno et al. This suggests that though we have improved the state of the art on LAMBADA by more than 40% absolute, we have not solved the general language modeling problem; there is no single model that performs well on both test and control. Our 36% estimate of human performance on control shows the difficulty of the general problem, and reveals a gap of 14% between the best language model and human accuracy.
A natural question to ask is whether applying neural readers is a good direction for this task, since they fail on the 17% of instances which do not have the target word in the context. Furthermore, this subset of LAMBADA may in fact display the most interesting and challenging phenomena. Some neural readers, like the Stanford Reader, can be easily used to predict target words that do not appear in the context, and the other readers can be modified to do so. Doing this will require a different selection of training data than that used above. However, we do wish to note that, in addition to the relative rarity of these instances in LAMBADA, we found them to be challenging for our annotator (who was correct on only 7 of the 13 in this subset).
We note that train has similar characteristics to the part of control that contains the answer in the context (the final column of Table 1). We find that the ranking of systems according to this column is similar to that in the test column. This suggests that our simple method of dataset creation could be used to create additional training or evaluation sets for challenging language modeling problems like LAMBADA, perhaps by combining it with baseline suppression [Onishi et al.2016].
We constructed a new training set for LAMBADA and used it to train neural readers to improve the state of the art from 7.3% to 49%. We also provided results with several other strong baselines and included a manual evaluation in an attempt to better understand the phenomena tested by the task. Our hope is that other researchers will seek models and training regimes that simultaneously perform well on both LAMBADA and control, with the goal of solving the general problem of language modeling.
We thank Denis Paperno for answering our questions about the LAMBADA dataset and we thank NVIDIA Corporation for donating GPUs used in this research.
- [Chen et al.2016] Danqi Chen, Jason Bolton, and Christopher D. Manning. 2016. A thorough examination of the CNN/Daily Mail reading comprehension task. In Proc. of ACL.
- [Cho et al.2014] Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation. In Proc. of EMNLP.
- [Dhingra et al.2016] Bhuwan Dhingra, Hanxiao Liu, William W. Cohen, and Ruslan Salakhutdinov. 2016. Gated-attention readers for text comprehension. arXiv preprint.
- [Hermann et al.2015] Karl Moritz Hermann, Tomáš Kočiský, Edward Grefenstette, Lasse Espeholt, Will Kay, Mustafa Suleyman, and Phil Blunsom. 2015. Teaching machines to read and comprehend. In Proc. of NIPS.
- [Hill et al.2016] Felix Hill, Antoine Bordes, Sumit Chopra, and Jason Weston. 2016. The Goldilocks principle: Reading children’s books with explicit memory representations. In Proc. of ICLR.
- [Hochreiter and Schmidhuber1997] Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Computation, 9(8).
- [Kadlec et al.2016] Rudolf Kadlec, Martin Schmid, Ondrej Bajgar, and Jan Kleindienst. 2016. Text understanding with the attention sum reader network. In Proc. of ACL.
- [Onishi et al.2016] Takeshi Onishi, Hai Wang, Mohit Bansal, Kevin Gimpel, and David McAllester. 2016. Who did What: A large-scale person-centered cloze dataset. In Proc. of EMNLP.
- [Paperno et al.2016] Denis Paperno, Germán Kruszewski, Angeliki Lazaridou, Quan Ngoc Pham, Raffaella Bernardi, Sandro Pezzelle, Marco Baroni, Gemma Boleda, and Raquel Fernández. 2016. The LAMBADA dataset: Word prediction requiring a broad discourse context. In Proc. of ACL.
- [Richardson et al.2013] Matthew Richardson, Christopher JC Burges, and Erin Renshaw. 2013. MCTest: A challenge dataset for the open-domain machine comprehension of text. In Proc. of EMNLP.
- [Stolcke2002] Andreas Stolcke. 2002. SRILM-an extensible language modeling toolkit. In Proc. of Interspeech.
- [Wang et al.2016] Hai Wang, Takeshi Onishi, Kevin Gimpel, and David McAllester. 2016. Emergent logical structure in vector representations of neural readers. arXiv preprint.
- [Zhu et al.2015] Yukun Zhu, Ryan Kiros, Richard Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. 2015. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In Proc. of ICCV.