Learning Recurrent Span Representations for Extractive Question Answering

11/04/2016 ∙ by Kenton Lee, et al. ∙ 0

The reading comprehension task, that asks questions about a given evidence document, is a central problem in natural language understanding. Recent formulations of this task have typically focused on answer selection from a set of candidates pre-defined manually or through the use of an external NLP pipeline. However, Rajpurkar et al. (2016) recently released the SQuAD dataset in which the answers can be arbitrary strings from the supplied text. In this paper, we focus on this answer extraction task, presenting a novel model architecture that efficiently builds fixed length representations of all spans in the evidence document with a recurrent network. We show that scoring explicit span representations significantly improves performance over other approaches that factor the prediction into separate predictions about words or start and end markers. Our approach improves upon the best published results of Wang & Jiang (2016) by 5 baseline by > 50



There are no comments yet.


page 1

page 2

page 3

page 4

Code Repositories


Implementation of "Learning Recurrent Span Representations for Extractive Question Answering" (Lee et al. 2016)

view repo


Pytorch implementation of the RaSoR paper "Learning Recurrent Span Representations for Extractive Question Answering" (Lee et al. 2016) and experiments with various neural components

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

A primary goal of natural language processing is to develop systems that can answer questions about the contents of documents. The reading comprehension task is of practical interest – we want computers to be able to read the world’s text and then answer our questions – and, since we believe it requires deep language understanding, it has also become a flagship task in NLP research.

A number of reading comprehension datasets have been developed that focus on answer selection from a small set of alternatives defined by annotators (Richardson et al., 2013) or existing NLP pipelines that cannot be trained end-to-end (Hill et al., 2016; Hermann et al., 2015). Subsequently, the models proposed for this task have tended to make use of the limited set of candidates, basing their predictions on mention-level attention weights (Hermann et al., 2015)

, or centering classifiers 

(Chen et al., 2016), or network memories (Hill et al., 2016) on candidate locations.

Recently, Rajpurkar et al. (2016) released the less restricted SQuAD dataset111http://stanford-qa.com that does not place any constraints on the set of allowed answers, other than that they should be drawn from the evidence document. Rajpurkar et al. proposed a baseline system that chooses answers from the constituents identified by an existing syntactic parser. This allows them to prune the answer candidates in each document of length , but it also effectively renders of all questions unanswerable.

Subsequent work by Wang & Jiang (2016)

significantly improve upon this baseline by using an end-to-end neural network architecture to identify answer spans by labeling either individual words, or the start and end of the answer span. Both of these methods do not make independence assumptions about substructures, but they are susceptible to search errors due to greedy training and decoding.

In contrast, here we argue that it is beneficial to simplify the decoding procedure by enumerating all possible answer spans. By explicitly representing each answer span, our model can be globally normalized during training and decoded exactly during evaluation. A naive approach to building the spans of up to length would require a network that is cubic in size with respect to the passage length, and such a network would be untrainable. To overcome this, we present a novel neural architecture called RaSoR that builds fixed-length span representations, reusing recurrent computations for shared substructures. We demonstrate that directly classifying each of the competing spans, and training with global normalization over all possible spans, leads to a significant increase in performance. In our experiments, we show an increase in performance over Wang & Jiang (2016) of in terms of exact match to a reference answer, and in terms of predicted answer F1 with respect to the reference. On both of these metrics, we close the gap between Rajpurkar et al.’s baseline and the human-performance upper-bound by .

2 Extractive Question Answering

2.1 Task Definition

Extractive question answering systems take as input a question and a passage of text from which they predict a single answer span , represented as a pair of indices into

. Machine learned extractive question answering systems, such as the one presented here, learn a predictor function

from a training dataset of triples.

2.2 Related Work

For the SQuAD dataset, the original paper from Rajpurkar et al. (2016) implemented a linear model with sparse features based on -grams and part-of-speech tags present in the question and the candidate answer. Other than lexical features, they also used syntactic information in the form of dependency paths to extract more general features. They set a strong baseline for following work and also presented an in depth analysis, showing that lexical and syntactic features contribute most strongly to their model’s performance. Subsequent work by Wang & Jiang (2016) use an end-to-end neural network method that uses a Match-LSTM to model the question and the passage, and uses pointer networks (Vinyals et al., 2015) to extract the answer span from the passage. This model resorts to greedy decoding and falls short in terms of performance compared to our model (see Section 5 for more detail). While we only compare to published baselines, there are other unpublished competitive systems on the SQuAD leaderboard, as listed in footnote 4.

A task that is closely related to extractive question answering is the Cloze task (Taylor, 1953), in which the goal is to predict a concealed span from a declarative sentence given a passage of supporting text. Recently, Hermann et al. (2015) presented a Cloze dataset in which the task is to predict the correct entity in an incomplete sentence given an abstractive summary of a news article. Hermann et al. also present various neural architectures to solve the problem. Although this dataset is large and varied in domain, recent analysis by Chen et al. (2016) shows that simple models can achieve close to the human upper bound. As noted by the authors of the SQuAD paper, the annotated answers in the SQuAD dataset are often spans that include non-entities and can be longer phrases, unlike the Cloze datasets, thus making the task more challenging.

Another, more traditional line of work has focused on extractive question answering on sentences, where the task is to extract a sentence from a document, given a question. Relevant datasets include datasets from the annual TREC evaluations (Voorhees & Tice, 2000) and WikiQA (Yang et al., 2015), where the latter dataset specifically focused on Wikipedia passages. There has been a line of interesting recent publications using neural architectures, focused on this variety of extractive question answering (Tymoshenko et al., 2016; Wang et al., 2016, inter alia). These methods model the question and a candidate answer sentence, but do not focus on possible candidate answer spans that may contain the answer to the given question. In this work, we focus on the more challenging problem of extracting the precise answer span.

3 Model

We propose a model architecture called RaSoR222An abbreviation for Recurrent Span Representations, pronounced as razor. illustrated in Figure 1, that explicitly computes embedding representations for candidate answer spans. In most structured prediction problems (e.g. sequence labeling or parsing), the number of possible output structures is exponential in the input length, and computing representations for every candidate is prohibitively expensive. However, we exploit the simplicity of our task, where we can trivially and tractably enumerate all candidates. This facilitates an expressive model that computes joint representations of every answer span, that can be globally normalized during learning.

In order to compute these span representations, we must aggregate information from the passage and the question for every answer candidate. For the example in Figure 1, RaSoR computes an embedding for the candidate answer spans: fixed to, fixed to the, to the, etc. A naive approach for these aggregations would require a network that is cubic in size with respect to the passage length. Instead, our model reduces this to a quadratic size by reusing recurrent computations for shared substructures (i.e. common passage words) from different spans.

Since the choice of answer span depends on the original question, we must incorporate this information into the computation of the span representation. We model this by augmenting the passage word embeddings with additional embedding representations of the question.

In this section, we motivate and describe the architecture for RaSoR in a top-down manner.

3.1 Scoring Answer Spans

The goal of our extractive question answering system is to predict the single best answer span among all candidates from the passage , denoted as

. Therefore, we define a probability distribution over all possible answer spans given the question

and passage , and the predictor function finds the answer span with the maximum likelihood:


One might be tempted to introduce independence assumptions that would enable cheaper decoding. For example, this distribution can be modeled as (1) a product of conditionally independent distributions (binary) for every word or (2) a product of conditionally independent distributions (over words) for the start and end indices of the answer span. However, we show in Section 5.2 that such independence assumptions hurt the accuracy of the model, and instead we only assume a fixed-length representation

of each candidate span that is scored and normalized with a softmax layer (

Span score and Softmax in Figure 1):



denotes a fully connected feed-forward neural network that provides a non-linear mapping of its input embedding.

3.2 RaSoR: Recurrent Span Representation

The previously defined probability distribution depends on the answer span representations, . When computing , we assume access to representations of individual passage words that have been augmented with a representation of the question. We denote these question-focused passage word embeddings as and describe their creation in Section 3.3. In order to reuse computation for shared substructures, we use a bidirectional LSTM (Hochreiter & Schmidhuber, 1997) to encode the left and right context of every (Passage-level BiLSTM in Figure 1). This allows us to simply concatenate the bidirectional LSTM (BiLSTM) outputs at the endpoints of a span to jointly encode its inside and outside information (Span embedding in Figure 1):


where denotes a BiLSTM over its input embedding sequence and is the concatenation of forward and backward outputs at time-step . While the visualization in Figure 1 shows a single layer BiLSTM for simplicity, we use a multi-layer BiLSTM in our experiments. The concatenated output of each layer is used as input for the subsequent layer, allowing the upper layers to depend on the entire passage.

3.3 Question-focused Passage Word Embedding

Computing the question-focused passage word embeddings requires integrating question information into the passage. The architecture for this integration is flexible and likely depends on the nature of the dataset. For the SQuAD dataset, we find that both passage-aligned and passage-independent question representations are effective at incorporating this contextual information, and experiments will show that their benefits are complementary. To incorporate these question representations, we simply concatenate them with the passage word embeddings (Question-focused passage word embedding in Figure 1).

We use fixed pretrained embeddings to represent question and passage words. Therefore, in the following discussion, notation for the words are interchangeable with their embedding representations.

Question-independent passage word embedding

The first component simply looks up the pretrained word embedding for the passage word, .

Passage-aligned question representation

In this dataset, the question-passage pairs often contain large lexical overlap or similarity near the correct answer span. To encourage the model to exploit these similarities, we include a fixed-length representation of the question based on soft-alignments with the passage word. The alignments are computed via neural attention (Bahdanau et al., 2014), and we use the variant proposed by Parikh et al. (2016), where attention scores are dot products between non-linear mappings of word embeddings.


Passage-independent question representation

We also include a representation of the question that does not depend on the passage and is shared for all passage words.

Similar to the previous question representation, an attention score is computed via a dot-product, except the question word is compared to a universal learned embedding rather any particular passage word. Additionally, we incorporate contextual information with a BiLSTM before aggregating the outputs using this attention mechanism.

The goal is to generate a coarse-grained summary of the question that depends on word order. Formally, the passage-independent question representation is computed as follows:


This representation is a bidirectional generalization of the question representation recently proposed by Li et al. (2016) for a different question-answering task.

Given the above three components, the complete question-focused passage word embedding for is their concatenation: .

3.4 Learning

Given the above model specification, learning is straightforward. We simply maximize the log-likelihood of the correct answer candidates and backpropagate the errors end-to-end.





fixed to

fixedto the

to the

to theturbine



Span score

Hidden layer

Span embedding


Question-focusedpassage word embedding

Passage-independentquestion representation














Figure 1: A visualization of RaSoR, where the question is “What are the stators attached to?” and the passage is “…fixed to the turbine …”. The model constructs question-focused passage word embeddings by concatenating (1) the original passage word embedding, (2) a passage-aligned representation of the question, and (3) a passage-independent representation of the question shared across all passage words. We use a BiLSTM over these concatenated embeddings to efficiently recover embedding representations of all possible spans, which are then scored by the final layer of the model.

4 Experimental Setup

We represent each of the words in the question and document using 300 dimensional GloVe embeddings trained on a corpus of words (Pennington et al., 2014). These embeddings cover words and all out of vocabulary (OOV) words are projected onto one of randomly initialized embeddings. We couple the input and forget gates in our LSTMs, as described in Greff et al. (2016), and we use a single dropout mask to apply dropout across all LSTM time-steps as proposed by Gal & Ghahramani (2016)

. Hidden layers in the feed forward neural networks use rectified linear units

(Nair & Hinton, 2010). Answer candidates are limited to spans with at most 30 words.

To choose the final model configuration, we ran grid searches over: the dimensionality of the LSTM hidden states; the width and depth of the feed forward neural networks; dropout for the LSTMs; the number of stacked LSTM layers ; and the decay multiplier with which we multiply the learning rate every steps. The best model uses LSTM states; two-layer BiLSTMs for the span encoder and the passage-independent question representation; dropout of throughout; and a learning rate decay of every steps.

All models are implemented using TensorFlow

333www.tensorflow.org and trained on the SQuAD training set using the ADAM (Kingma & Ba, 2015) optimizer with a mini-batch size of and trained using asynchronous training threads on a single machine.

5 Results

We train on the (question, passage, answer span) triples in the SQuAD training set and report results on the examples in the SQuAD development and test sets.

All results are calculated using the official SQuAD evaluation script, which reports exact answer match and F1 overlap of the unigrams between the predicted answer and the closest labeled answer from the reference answers given in the SQuAD development set.

5.1 Comparisons to other work

Our model with recurrent span representations (RaSoR) is compared to all previously published systems 444As of submission, other unpublished systems are shown on the SQuAD leaderboard, including Match-LSTM with Ans-Ptr (Boundary+Ensemble), Co-attention, r-net, Match-LSTM with Bi-Ans-Ptr (Boundary), Co-attention old, Dynamic Chunk Reader, Dynamic Chunk Ranker with Convolution layer, Attentive Chunker.. Rajpurkar et al. (2016)

published a logistic regression baseline as well as human performance on the

SQuAD task. The logistic regression baseline uses the output of an existing syntactic parser both as a constraint on the set of allowed answer spans, and as a method of creating sparse features for an answer-centric scoring model. Despite not having access to any external representation of linguistic structure, RaSoR achieves an error reduction of more than over this baseline, both in terms of exact match and F1, relative to the human performance upper bound.

Dev Test
System EM F1 EM F1
Logistic regression baseline 39.8 51.0 40.4 51.0
Match-LSTM (Sequence) 54.5 67.7 54.8 68.0
Match-LSTM (Boundary) 60.5 70.7 59.4 70.0
RaSoR 66.4 74.9 67.4 75.5
Human 81.4 91.0 82.3 91.2
Table 1: Exact match (EM) and span F1 on SQuAD.

More closely related to RaSoR is the boundary model with Match-LSTMs and Pointer Networks by Wang & Jiang (2016). Their model similarly uses recurrent networks to learn embeddings of each passage word in the context of the question, and it can also capture interactions between endpoints, since the end index probability distribution is conditioned on the start index. However, both training and evaluation are greedy, making their system susceptible to search errors when decoding. In contrast, RaSoR can efficiently and explicitly model the quadratic number of possible answers, which leads to a error reduction over the best performing Match-LSTM model.

5.2 Model Variations

We investigate two main questions in the following ablations and comparisons. (1) How important are the two methods of representing the question described in Section 3.3

? (2) What is the impact of learning a loss function that accurately reflects the span prediction task?

Question representations

Table (a)a shows the performance of RaSoR when either of the two question representations described in Section 3.3 is removed. The passage-aligned question representation is crucial, since lexically similar regions of the passage provide strong signal for relevant answer spans. If the question is only integrated through the inclusion of a passage-independent representation, performance drops drastically. The passage-independent question representation over the BiLSTM is less important, but it still accounts for over exact match and F1. The input of both of these components is analyzed qualitatively in Section 6.

Question representation EM F1
Only passage-independent 48.7 56.6
Only passage-aligned 63.1 71.3
RaSoR 66.4 74.9
(a) Ablation of question representations.
Learning objective EM F1
Membership prediction 57.9 69.7
BIO sequence prediction 63.9 73.0
Endpoints prediction 65.3 75.1
Span prediction w/ log loss 65.2 73.6
(b) Comparisons for different learning objectives given the same passage-level BiLSTM.
Table 2: Results for variations of the model architecture presented in Section 3.

Learning objectives

Given a fixed architecture that is capable of encoding the input question-passage pairs, there are many ways of setting up a learning objective to encourage the model to predict the correct span. In Table (b)b, we provide comparisons of some alternatives (learned end-to-end) given only the passage-level BiLSTM from RaSoR. In order to provide clean comparisons, we restrict the alternatives to objectives that are trained and evaluated with exact decoding.

The simplest alternative is to consider this task as binary classification for every word (Membership prediction in Table (b)b). In this baseline, we optimize the logistic loss for binary labels indicating whether passage words belong to the correct answer span. At prediction time, a valid span can be recovered in linear time by finding the maximum contiguous sum of scores.

Li et al. (2016) proposed a sequence-labeling scheme that is similar to the above baseline (BIO sequence prediction in Table (b)b). We follow their proposed model and learn a conditional random field (CRF) layer after the passage-level BiLSTM to model transitions between the different labels. At prediction time, a valid span can be recovered in linear time using Viterbi decoding, with hard transition constraints to enforce a single contiguous output.

We also consider a model that independently predicts the two endpoints of the answer span (Endpoints prediction in Table (b)b). This model uses the softmax loss over passage words during learning. When decoding, we only need to enforce the constraint that the start index is no greater than the end index. Without the interactions between the endpoints, this can be computed in linear time. Note that this model has the same expressivity as RaSoR if the span-level FFNN were removed.

Lastly, we compare with a model using the same architecture as RaSoR but is trained with a binary logistic loss rather than a softmax loss over spans (Span prediction w/ logistic loss in Table (b)b).

The trend in Table (b)b shows that the model is better at leveraging the supervision as the learning objective more accurately reflects the fundamental task at hand: determining the best answer span.

First, we observe general improvements when using labels that closely align with the task. For example, the labels for membership prediction simply happens to provide single contiguous spans in the supervision. The model must consider far more possible answers than it needs to (the power set of all words). The same problem holds for BIO sequence prediction– the model must do additional work to learn the semantics of the BIO tags. On the other hand, in RaSoR, the semantics of an answer span is naturally encoded by the set of labels.

Second, we observe the importance of allowing interactions between the endpoints using the span-level FFNN. RaSoR outperforms the endpoint prediction model by in exact match, The interaction between endpoints enables RaSoR to enforce consistency across its two substructures. While this does not provide improvements for predicting the correct region of the answer (captured by the F1 metric, which drops by 0.2), it is more likely to predict a clean answer span that matches human judgment exactly (captured by the exact-match metric).

6 Analysis

Figure 3 shows how the performances of RaSoR and the endpoint predictor introduced in Section 5.2 degrade as the lengths of their predictions increase. It is clear that explicitly modeling interactions between end markers is increasingly important as the span grows in length.

Figure 2: F1 and Exact Match (EM) accuracy of RaSoR and the endpoint predictor baseline over different prediction lengths.
Figure 3: Attention masks from RaSoR. Top predictions for the first example are ’Egyptians’, ’Egyptians against the British’, ’British’. Top predictions for the second are ’unjust laws’, ’what they deem to be unjust laws’, ’laws’.

Figure 3 shows attention masks for both of RaSoR’s question representations. The passage-independent question representation pays most attention to the words that could attach to the answer in the passage (“brought”, “against”) or describe the answer category (“people”). Meanwhile, the passage-aligned question representation pays attention to similar words. The top predictions for both examples are all valid syntactic constituents, and they all have the correct semantic category. However, RaSoR assigns almost as much probability mass to it’s incorrect third prediction “British” as it does to the top scoring correct prediction “Egyptian”. This showcases a common failure case for RaSoR, where it can find an answer of the correct type close to a phrase that overlaps with the question – but it cannot accurately represent the semantic dependency on that phrase.

7 Conclusion

We have shown a novel approach for perform extractive question answering on the SQuAD dataset by explicitly representing and scoring answer span candidates. The core of our model relies on a recurrent network that enables shared computation for the shared substructure across span candidates. We explore different methods of encoding the passage and question, showing the benefits of including both passage-independent and passage-aligned question representations. While we show that this encoding method is beneficial for the task, this is orthogonal to the core contribution of efficiently computing span representation. In future work, we plan to explore alternate architectures that provide input to the recurrent span representations.


  • Bahdanau et al. (2014) Dzmitry Bahdanau, KyungHyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.
  • Chen et al. (2016) Danqi Chen, Jason Bolton, and Christopher D. Manning. A thorough examination of the cnn/daily mail reading comprehension task. In Proceedings of ACL, 2016.
  • Gal & Ghahramani (2016) Yarin Gal and Zoubin Ghahramani.

    A theoretically grounded application of dropout in recurrent neural networks.

    Proceedings of NIPS, 2016.
  • Greff et al. (2016) Klaus Greff, Rupesh Kumar Srivastava, Jan Koutník, Bas R. Steunebrink, and Jürgen Schmidhuber. LSTM: A search space odyssey. IEEE Transactions on Neural Networks and Learning Systems, PP:1–11, 2016.
  • Hermann et al. (2015) Karl Moritz Hermann, Tomas Kocisky, Edward Grefenstette, Lasse Espeholt, Will Kay, Mustafa Suleyman, and Phil Blunsom. Teaching machines to read and comprehend. In Proceedings of NIPS, 2015.
  • Hill et al. (2016) Felix Hill, Antoine Bordes, Sumit Chopra, and Jason Weston. The goldilocks principle: Reading children’s books with explicit memory representations. In Proceedings of ICLR, 2016.
  • Hochreiter & Schmidhuber (1997) Sepp Hochreiter and Jürgen Schmidhuber. Long Short-term Memory. Neural computation, 9(8):1735–1780, 1997.
  • Kingma & Ba (2015) Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. Proceedings of ICLR, 2015.
  • Li et al. (2016) Peng Li, Wei Li, Zhengyan He, Xuguang Wang, Ying Cao, Jie Zhou, and Wei Xu. Dataset and neural recurrent sequence labeling model for open-domain factoid question answering. CoRR, abs/1607.06275, 2016.
  • Nair & Hinton (2010) Vinod Nair and Geoffrey E Hinton.

    Rectified linear units improve restricted boltzmann machines.

    In Proceedings of ICML, 2010.
  • Parikh et al. (2016) Ankur P Parikh, Oscar Täckström, Dipanjan Das, and Jakob Uszkoreit.

    A decomposable attention model for natural language inference.

    In Proceedings of EMNLP, 2016.
  • Pennington et al. (2014) Jeffrey Pennington, Richard Socher, and Christopher D Manning.

    Glove: Global vectors for word representation.

    In Proceedings of EMNLP, 2014.
  • Rajpurkar et al. (2016) Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. SQuAD: 100, 000+ questions for machine comprehension of text. In Proceedings of EMNLP, 2016.
  • Richardson et al. (2013) Matthew Richardson, Christopher JC Burges, and Erin Renshaw. Mctest: A challenge dataset for the open-domain machine comprehension of text. In Proceedings of EMNLP, 2013.
  • Taylor (1953) Wilson Taylor. Cloze procedure: A new tool for measuring readability. Journalism Quarterly, 30:415–433, 1953.
  • Tymoshenko et al. (2016) Kateryna Tymoshenko, Daniele Bonadiman, and Alessandro Moschitti. Convolutional neural networks vs. convolution kernels: Feature engineering for answer sentence reranking. In Proceedings of NAACL, 2016.
  • Vinyals et al. (2015) Oriol Vinyals, Meire Fortunato, and Navdeep Jaitly. Pointer networks. In Proceedings of NIPS, 2015.
  • Voorhees & Tice (2000) Ellen M. Voorhees and Dawn M. Tice. Building a question answering test collection. In Proceedings of SIGIR, 2000.
  • Wang et al. (2016) Bingning Wang, Kang Liu, and Jun Zhao. Inner attention based recurrent neural networks for answer selection. In Proceedings of ACL, 2016.
  • Wang & Jiang (2016) Shuohang Wang and Jing Jiang. Machine comprehension using match-lstm and answer pointer. arXiv preprint arXiv:1608.07905, 2016.
  • Yang et al. (2015) Yi Yang, Wen-tau Yih, and Christopher Meek. Wikiqa: A challenge dataset for open-domain question answering. In Proceedings of EMNLP, 2015.