Implementation of "Learning Recurrent Span Representations for Extractive Question Answering" (Lee et al. 2016)
The reading comprehension task, that asks questions about a given evidence document, is a central problem in natural language understanding. Recent formulations of this task have typically focused on answer selection from a set of candidates pre-defined manually or through the use of an external NLP pipeline. However, Rajpurkar et al. (2016) recently released the SQuAD dataset in which the answers can be arbitrary strings from the supplied text. In this paper, we focus on this answer extraction task, presenting a novel model architecture that efficiently builds fixed length representations of all spans in the evidence document with a recurrent network. We show that scoring explicit span representations significantly improves performance over other approaches that factor the prediction into separate predictions about words or start and end markers. Our approach improves upon the best published results of Wang & Jiang (2016) by 5 baseline by > 50READ FULL TEXT VIEW PDF
Implementation of "Learning Recurrent Span Representations for Extractive Question Answering" (Lee et al. 2016)
Pytorch implementation of the RaSoR paper "Learning Recurrent Span Representations for Extractive Question Answering" (Lee et al. 2016) and experiments with various neural components
A primary goal of natural language processing is to develop systems that can answer questions about the contents of documents. The reading comprehension task is of practical interest – we want computers to be able to read the world’s text and then answer our questions – and, since we believe it requires deep language understanding, it has also become a flagship task in NLP research.
A number of reading comprehension datasets have been developed that focus on answer selection from a small set of alternatives defined by annotators (Richardson et al., 2013) or existing NLP pipelines that cannot be trained end-to-end (Hill et al., 2016; Hermann et al., 2015). Subsequently, the models proposed for this task have tended to make use of the limited set of candidates, basing their predictions on mention-level attention weights (Hermann et al., 2015)
, or centering classifiers(Chen et al., 2016), or network memories (Hill et al., 2016) on candidate locations.
Recently, Rajpurkar et al. (2016) released the less restricted SQuAD dataset111http://stanford-qa.com that does not place any constraints on the set of allowed answers, other than that they should be drawn from the evidence document. Rajpurkar et al. proposed a baseline system that chooses answers from the constituents identified by an existing syntactic parser. This allows them to prune the answer candidates in each document of length , but it also effectively renders of all questions unanswerable.
Subsequent work by Wang & Jiang (2016)
significantly improve upon this baseline by using an end-to-end neural network architecture to identify answer spans by labeling either individual words, or the start and end of the answer span. Both of these methods do not make independence assumptions about substructures, but they are susceptible to search errors due to greedy training and decoding.
In contrast, here we argue that it is beneficial to simplify the decoding procedure by enumerating all possible answer spans. By explicitly representing each answer span, our model can be globally normalized during training and decoded exactly during evaluation. A naive approach to building the spans of up to length would require a network that is cubic in size with respect to the passage length, and such a network would be untrainable. To overcome this, we present a novel neural architecture called RaSoR that builds fixed-length span representations, reusing recurrent computations for shared substructures. We demonstrate that directly classifying each of the competing spans, and training with global normalization over all possible spans, leads to a significant increase in performance. In our experiments, we show an increase in performance over Wang & Jiang (2016) of in terms of exact match to a reference answer, and in terms of predicted answer F1 with respect to the reference. On both of these metrics, we close the gap between Rajpurkar et al.’s baseline and the human-performance upper-bound by .
Extractive question answering systems take as input a question and a passage of text from which they predict a single answer span , represented as a pair of indices into
. Machine learned extractive question answering systems, such as the one presented here, learn a predictor functionfrom a training dataset of triples.
For the SQuAD dataset, the original paper from Rajpurkar et al. (2016) implemented a linear model with sparse features based on -grams and part-of-speech tags present in the question and the candidate answer. Other than lexical features, they also used syntactic information in the form of dependency paths to extract more general features. They set a strong baseline for following work and also presented an in depth analysis, showing that lexical and syntactic features contribute most strongly to their model’s performance. Subsequent work by Wang & Jiang (2016) use an end-to-end neural network method that uses a Match-LSTM to model the question and the passage, and uses pointer networks (Vinyals et al., 2015) to extract the answer span from the passage. This model resorts to greedy decoding and falls short in terms of performance compared to our model (see Section 5 for more detail). While we only compare to published baselines, there are other unpublished competitive systems on the SQuAD leaderboard, as listed in footnote 4.
A task that is closely related to extractive question answering is the Cloze task (Taylor, 1953), in which the goal is to predict a concealed span from a declarative sentence given a passage of supporting text. Recently, Hermann et al. (2015) presented a Cloze dataset in which the task is to predict the correct entity in an incomplete sentence given an abstractive summary of a news article. Hermann et al. also present various neural architectures to solve the problem. Although this dataset is large and varied in domain, recent analysis by Chen et al. (2016) shows that simple models can achieve close to the human upper bound. As noted by the authors of the SQuAD paper, the annotated answers in the SQuAD dataset are often spans that include non-entities and can be longer phrases, unlike the Cloze datasets, thus making the task more challenging.
Another, more traditional line of work has focused on extractive question answering on sentences, where the task is to extract a sentence from a document, given a question. Relevant datasets include datasets from the annual TREC evaluations (Voorhees & Tice, 2000) and WikiQA (Yang et al., 2015), where the latter dataset specifically focused on Wikipedia passages. There has been a line of interesting recent publications using neural architectures, focused on this variety of extractive question answering (Tymoshenko et al., 2016; Wang et al., 2016, inter alia). These methods model the question and a candidate answer sentence, but do not focus on possible candidate answer spans that may contain the answer to the given question. In this work, we focus on the more challenging problem of extracting the precise answer span.
We propose a model architecture called RaSoR222An abbreviation for Recurrent Span Representations, pronounced as razor. illustrated in Figure 1, that explicitly computes embedding representations for candidate answer spans. In most structured prediction problems (e.g. sequence labeling or parsing), the number of possible output structures is exponential in the input length, and computing representations for every candidate is prohibitively expensive. However, we exploit the simplicity of our task, where we can trivially and tractably enumerate all candidates. This facilitates an expressive model that computes joint representations of every answer span, that can be globally normalized during learning.
In order to compute these span representations, we must aggregate information from the passage and the question for every answer candidate. For the example in Figure 1, RaSoR computes an embedding for the candidate answer spans: fixed to, fixed to the, to the, etc. A naive approach for these aggregations would require a network that is cubic in size with respect to the passage length. Instead, our model reduces this to a quadratic size by reusing recurrent computations for shared substructures (i.e. common passage words) from different spans.
Since the choice of answer span depends on the original question, we must incorporate this information into the computation of the span representation. We model this by augmenting the passage word embeddings with additional embedding representations of the question.
In this section, we motivate and describe the architecture for RaSoR in a top-down manner.
The goal of our extractive question answering system is to predict the single best answer span among all candidates from the passage , denoted as
. Therefore, we define a probability distribution over all possible answer spans given the questionand passage , and the predictor function finds the answer span with the maximum likelihood:
One might be tempted to introduce independence assumptions that would enable cheaper decoding. For example, this distribution can be modeled as (1) a product of conditionally independent distributions (binary) for every word or (2) a product of conditionally independent distributions (over words) for the start and end indices of the answer span. However, we show in Section 5.2 that such independence assumptions hurt the accuracy of the model, and instead we only assume a fixed-length representation
of each candidate span that is scored and normalized with a softmax layer (Span score and Softmax in Figure 1):
denotes a fully connected feed-forward neural network that provides a non-linear mapping of its input embedding.
The previously defined probability distribution depends on the answer span representations, . When computing , we assume access to representations of individual passage words that have been augmented with a representation of the question. We denote these question-focused passage word embeddings as and describe their creation in Section 3.3. In order to reuse computation for shared substructures, we use a bidirectional LSTM (Hochreiter & Schmidhuber, 1997) to encode the left and right context of every (Passage-level BiLSTM in Figure 1). This allows us to simply concatenate the bidirectional LSTM (BiLSTM) outputs at the endpoints of a span to jointly encode its inside and outside information (Span embedding in Figure 1):
where denotes a BiLSTM over its input embedding sequence and is the concatenation of forward and backward outputs at time-step . While the visualization in Figure 1 shows a single layer BiLSTM for simplicity, we use a multi-layer BiLSTM in our experiments. The concatenated output of each layer is used as input for the subsequent layer, allowing the upper layers to depend on the entire passage.
Computing the question-focused passage word embeddings requires integrating question information into the passage. The architecture for this integration is flexible and likely depends on the nature of the dataset. For the SQuAD dataset, we find that both passage-aligned and passage-independent question representations are effective at incorporating this contextual information, and experiments will show that their benefits are complementary. To incorporate these question representations, we simply concatenate them with the passage word embeddings (Question-focused passage word embedding in Figure 1).
We use fixed pretrained embeddings to represent question and passage words. Therefore, in the following discussion, notation for the words are interchangeable with their embedding representations.
The first component simply looks up the pretrained word embedding for the passage word, .
In this dataset, the question-passage pairs often contain large lexical overlap or similarity near the correct answer span. To encourage the model to exploit these similarities, we include a fixed-length representation of the question based on soft-alignments with the passage word. The alignments are computed via neural attention (Bahdanau et al., 2014), and we use the variant proposed by Parikh et al. (2016), where attention scores are dot products between non-linear mappings of word embeddings.
We also include a representation of the question that does not depend on the passage and is shared for all passage words.
Similar to the previous question representation, an attention score is computed via a dot-product, except the question word is compared to a universal learned embedding rather any particular passage word. Additionally, we incorporate contextual information with a BiLSTM before aggregating the outputs using this attention mechanism.
The goal is to generate a coarse-grained summary of the question that depends on word order. Formally, the passage-independent question representation is computed as follows:
This representation is a bidirectional generalization of the question representation recently proposed by Li et al. (2016) for a different question-answering task.
Given the above three components, the complete question-focused passage word embedding for is their concatenation: .
Given the above model specification, learning is straightforward. We simply maximize the log-likelihood of the correct answer candidates and backpropagate the errors end-to-end.
We represent each of the words in the question and document using 300 dimensional GloVe embeddings trained on a corpus of words (Pennington et al., 2014). These embeddings cover words and all out of vocabulary (OOV) words are projected onto one of randomly initialized embeddings. We couple the input and forget gates in our LSTMs, as described in Greff et al. (2016), and we use a single dropout mask to apply dropout across all LSTM time-steps as proposed by Gal & Ghahramani (2016)
. Hidden layers in the feed forward neural networks use rectified linear units(Nair & Hinton, 2010). Answer candidates are limited to spans with at most 30 words.
To choose the final model configuration, we ran grid searches over: the dimensionality of the LSTM hidden states; the width and depth of the feed forward neural networks; dropout for the LSTMs; the number of stacked LSTM layers ; and the decay multiplier with which we multiply the learning rate every steps. The best model uses LSTM states; two-layer BiLSTMs for the span encoder and the passage-independent question representation; dropout of throughout; and a learning rate decay of every steps.
We train on the (question, passage, answer span) triples in the SQuAD training set and report results on the examples in the SQuAD development and test sets.
All results are calculated using the official SQuAD evaluation script, which reports exact answer match and F1 overlap of the unigrams between the predicted answer and the closest labeled answer from the reference answers given in the SQuAD development set.
Our model with recurrent span representations (RaSoR) is compared to all previously published systems 444As of submission, other unpublished systems are shown on the SQuAD leaderboard, including Match-LSTM with Ans-Ptr (Boundary+Ensemble), Co-attention, r-net, Match-LSTM with Bi-Ans-Ptr (Boundary), Co-attention old, Dynamic Chunk Reader, Dynamic Chunk Ranker with Convolution layer, Attentive Chunker.. Rajpurkar et al. (2016)
published a logistic regression baseline as well as human performance on theSQuAD task. The logistic regression baseline uses the output of an existing syntactic parser both as a constraint on the set of allowed answer spans, and as a method of creating sparse features for an answer-centric scoring model. Despite not having access to any external representation of linguistic structure, RaSoR achieves an error reduction of more than over this baseline, both in terms of exact match and F1, relative to the human performance upper bound.
|Logistic regression baseline||39.8||51.0||40.4||51.0|
More closely related to RaSoR is the boundary model with Match-LSTMs and Pointer Networks by Wang & Jiang (2016). Their model similarly uses recurrent networks to learn embeddings of each passage word in the context of the question, and it can also capture interactions between endpoints, since the end index probability distribution is conditioned on the start index. However, both training and evaluation are greedy, making their system susceptible to search errors when decoding. In contrast, RaSoR can efficiently and explicitly model the quadratic number of possible answers, which leads to a error reduction over the best performing Match-LSTM model.
We investigate two main questions in the following ablations and comparisons. (1) How important are the two methods of representing the question described in Section 3.3
? (2) What is the impact of learning a loss function that accurately reflects the span prediction task?
Table (a)a shows the performance of RaSoR when either of the two question representations described in Section 3.3 is removed. The passage-aligned question representation is crucial, since lexically similar regions of the passage provide strong signal for relevant answer spans. If the question is only integrated through the inclusion of a passage-independent representation, performance drops drastically. The passage-independent question representation over the BiLSTM is less important, but it still accounts for over exact match and F1. The input of both of these components is analyzed qualitatively in Section 6.
Given a fixed architecture that is capable of encoding the input question-passage pairs, there are many ways of setting up a learning objective to encourage the model to predict the correct span. In Table (b)b, we provide comparisons of some alternatives (learned end-to-end) given only the passage-level BiLSTM from RaSoR. In order to provide clean comparisons, we restrict the alternatives to objectives that are trained and evaluated with exact decoding.
The simplest alternative is to consider this task as binary classification for every word (Membership prediction in Table (b)b). In this baseline, we optimize the logistic loss for binary labels indicating whether passage words belong to the correct answer span. At prediction time, a valid span can be recovered in linear time by finding the maximum contiguous sum of scores.
Li et al. (2016) proposed a sequence-labeling scheme that is similar to the above baseline (BIO sequence prediction in Table (b)b). We follow their proposed model and learn a conditional random field (CRF) layer after the passage-level BiLSTM to model transitions between the different labels. At prediction time, a valid span can be recovered in linear time using Viterbi decoding, with hard transition constraints to enforce a single contiguous output.
We also consider a model that independently predicts the two endpoints of the answer span (Endpoints prediction in Table (b)b). This model uses the softmax loss over passage words during learning. When decoding, we only need to enforce the constraint that the start index is no greater than the end index. Without the interactions between the endpoints, this can be computed in linear time. Note that this model has the same expressivity as RaSoR if the span-level FFNN were removed.
Lastly, we compare with a model using the same architecture as RaSoR but is trained with a binary logistic loss rather than a softmax loss over spans (Span prediction w/ logistic loss in Table (b)b).
The trend in Table (b)b shows that the model is better at leveraging the supervision as the learning objective more accurately reflects the fundamental task at hand: determining the best answer span.
First, we observe general improvements when using labels that closely align with the task. For example, the labels for membership prediction simply happens to provide single contiguous spans in the supervision. The model must consider far more possible answers than it needs to (the power set of all words). The same problem holds for BIO sequence prediction– the model must do additional work to learn the semantics of the BIO tags. On the other hand, in RaSoR, the semantics of an answer span is naturally encoded by the set of labels.
Second, we observe the importance of allowing interactions between the endpoints using the span-level FFNN. RaSoR outperforms the endpoint prediction model by in exact match, The interaction between endpoints enables RaSoR to enforce consistency across its two substructures. While this does not provide improvements for predicting the correct region of the answer (captured by the F1 metric, which drops by 0.2), it is more likely to predict a clean answer span that matches human judgment exactly (captured by the exact-match metric).
Figure 3 shows how the performances of RaSoR and the endpoint predictor introduced in Section 5.2 degrade as the lengths of their predictions increase. It is clear that explicitly modeling interactions between end markers is increasingly important as the span grows in length.
Figure 3 shows attention masks for both of RaSoR’s question representations. The passage-independent question representation pays most attention to the words that could attach to the answer in the passage (“brought”, “against”) or describe the answer category (“people”). Meanwhile, the passage-aligned question representation pays attention to similar words. The top predictions for both examples are all valid syntactic constituents, and they all have the correct semantic category. However, RaSoR assigns almost as much probability mass to it’s incorrect third prediction “British” as it does to the top scoring correct prediction “Egyptian”. This showcases a common failure case for RaSoR, where it can find an answer of the correct type close to a phrase that overlaps with the question – but it cannot accurately represent the semantic dependency on that phrase.
We have shown a novel approach for perform extractive question answering on the SQuAD dataset by explicitly representing and scoring answer span candidates. The core of our model relies on a recurrent network that enables shared computation for the shared substructure across span candidates. We explore different methods of encoding the passage and question, showing the benefits of including both passage-independent and passage-aligned question representations. While we show that this encoding method is beneficial for the task, this is orthogonal to the core contribution of efficiently computing span representation. In future work, we plan to explore alternate architectures that provide input to the recurrent span representations.
A theoretically grounded application of dropout in recurrent neural networks.Proceedings of NIPS, 2016.
Rectified linear units improve restricted boltzmann machines.In Proceedings of ICML, 2010.
A decomposable attention model for natural language inference.In Proceedings of EMNLP, 2016.
Glove: Global vectors for word representation.In Proceedings of EMNLP, 2014.