Contextualized Word Representations for Reading Comprehension

12/10/2017 ∙ by Shimi Salant, et al. ∙ Tel Aviv University 0

Reading a document and extracting an answer to a question about its content has attracted substantial attention recently, where most work has focused on the interaction between the question and the document. In this work we evaluate the importance of context when the question and the document are each read on their own. We take a standard neural architecture for the task of reading comprehension, and show that by providing rich contextualized word representations from a large language model, and allowing the model to choose between context dependent and context independent word representations, we can dramatically improve performance and reach state-of-the-art performance on the competitive SQuAD dataset.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Reading comprehension (RC) is a high-level task in natural language understanding that requires reading a document and answering questions about its content. RC has attracted substantial attention over the last few years with the advent of large annotated datasets Hermann et al. (2015); Rajpurkar et al. (2016); Trischler et al. (2016); Nguyen et al. (2016); Joshi et al. (2017)

, computing resources, and neural network models and optimization procedures

(Weston et al., 2015; Sukhbaatar et al., 2015; Kumar et al., 2015).

Reading comprehension models must invariably represent word tokens contextually, as a function of their encompassing sequence (document or question). The vast majority of RC systems encode contextualized representations of words in both the document and question as hidden states of bidirectional RNNs Hochreiter and Schmidhuber (1997); Schuster and Paliwal (1997); Cho et al. (2014), and focus model design and capacity around question-document interaction, carrying out calculations where information from both is available (Seo et al., 2016; Xiong et al., 2017b; Huang et al., 2017; Wang et al., 2017).

Analysis of current RC models has shown that models tend to react to simple word-matching between the question and document (Jia and Liang, 2017), as well as benefit from explicitly providing matching information in model inputs (Hu et al., 2017; Chen et al., 2017; Weissenborn et al., 2017). In this work, we hypothesize that the still-relatively-small size of RC datasets drives this behavior, which leads to models that make limited use of context when representing word tokens.

To illustrate this idea, we take a model that carries out only basic question-document interaction and prepend to it a module that produces token embeddings by explicitly gating between contextual and non-contextual representations (for both the document and question). This simple addition already places the model’s performance on par with recent work, and allows us to demonstrate the importance of context.

Motivated by these findings, we turn to a semi-supervised setting in which we leverage a language model, pre-trained on large amounts of data, as a sequence encoder which forcibly facilitates context utilization. We find that model performance substantially improves, reaching accuracy comparable to state-of-the-art on the competitive SQuAD dataset, showing that contextual word representations captured by the language model are beneficial for reading comprehension. 111Our complete code base is available at

2 Contextualized Word Representations

Problem definition

We consider the task of extractive reading comprehension: given a paragraph of text and a question , an answer span is to be extracted, i.e., a pair of indices into are to be predicted.

When encoding a word token in its encompassing sequence (question or passage), we are interested in allowing extra computation over the sequence and evaluating the extent to which context is utilized in the resultant representation. To that end, we employ a re-embedding component in which a contextual and a non-contextual representation are explicitly combined per token. Specifically, for a sequence of word-embeddings with , the re-embedding of the -th token is the result of a Highway layer Srivastava et al. (2015) and is defined as:

where is a function strictly of the word-type of the -th token, is a function of the enclosing sequence, are parameter matrices, and the element-wise product operator. We set , a concatenation of with where the latter is a character-based representation of the token’s word-type produced via a CNN over character embeddings Kim (2014). We note that word-embeddings are pre-trained (Pennington et al., 2014) and are kept fixed during training, as is commonly done in order to reduce model capacity and mitigate overfitting. We next describe different formulations for the contextual term .

RNN-based token re-embedding (Tr)

Here we set as the hidden states of the top layer in a stacked BiLSTM of multiple layers, each uni-directional LSTM in each layer having cells and .

LM-augmented token re-embedding (Tr+lm)

The simple module specified above allows better exploitation of the context that a token appears in, if such exploitation is needed and is not learned by the rest of the network, which operates over . Our findings in Section 4 indicate that context is crucial but that in our setting it may be utilized to a limited extent.

We hypothesize that the main determining factor in this behavior is the relatively small size of the data and its distribution, which does not require using long-range context in most examples. Therefore, we leverage a strong language model that was pre-trained on large corpora as a fixed encoder which supplies additional contextualized token representations. We denote these representations as and set for .

The LM we use is from Józefowicz et al. (2016),222Named BIG LSTM+CNN INPUTS in that work and available at trained on the One Billion Words Benchmark dataset Chelba et al. (2013). It consists of an initial layer which produces character-based word representations, followed by two stacked LSTM layers and a softmax prediction layer. The hidden state outputs of each LSTM layer are projected down to a lower dimension via a bottleneck layer Sak et al. (2014). We set to either the projections of the first layer, referred to as TR + LM(L1), or those of the second one, referred to as TR + LM(L2).

With both re-embedding schemes, we use the resulting representations as a drop-in replacement for the word-embedding inputs fed to a standard model, described next.

3 Base model

We build upon Lee et al. (2016), who proposed the RaSoR model. For word-embedding inputs and of dimension , RaSoR consists of the following components:

Passage-independent question representation

The question is encoded via a BiLSTM and the resulting hidden states are summarized via attention (Bahdanau et al., 2015; Parikh et al., 2016): . The attention coefficients

are normalized logits {


for a parameter vector

and a single layer feed-forward network.

Passage-aligned question representations

For each passage position , the question is encoded via attention operated over its word-embeddings . The coefficients are produced by normalizing the logits , where .

Augmented passage token representations

Each passage word-embedding is concatenated with its corresponding and with the independent to produce , and a BiLSTM is operated over the resulting vectors: .

Span representations

A candidate answer span with is represented as the concatenation of the corresponding augmented passage representations: . In order to avoid quadratic runtime, only spans up to length 30 are considered.

Prediction layer

Finally, each span representation is transformed to a logit for a parameter vector , and these logits are normalized to produce a distribution over spans. Learning is performed by maximizing the log-likelihood of the correct answer span.

4 Evaluation and Analysis

We evaluate our contextualization scheme on the SQuAD dataset Rajpurkar et al. (2016) which consists of 100,000+ paragraph-question-answer examples, crowdsourced from Wikipedia articles.

Importance of context

We are interested in evaluating the effect of our RNN-based re-embedding scheme on the performance of the downstream base model. However, the addition of the re-embedding module incurs additional depth and capacity for the resultant model. We therefore compare this model, termed RaSoR + TR, to a setting in which re-embedding is non-contextual, referred to as RaSoR + TR(MLP). Here we set

, a multi-layered perceptron on

, allowing for the additional computation to be carried out on word-level representations without any context and matching the model size and hyper-parameter search budget of RaSoR + TR. In Table 1 we compare these two variants over the development set and observe superior performance by the contextual one, illustrating the benefit of contextualization and specifically per-sequence contextualization which is done separately for the question and for the passage.

Context complements rare words

Our formulation lends itself to an inspection of the different dynamic weightings computed by the model for interpolating between contextual and non-contextual terms. In Figure

1 we plot the average gate value for each word-type, where the average is taken across entries of the gate vector and across all occurrences of the word in both passages and questions. This inspection reveals the following: On average, the less frequent a word-type is, the smaller are its gate activations, i.e., the re-embedded representation of a rare word places less weight on its fixed word-embedding and more on its contextual representation, compared to a common word. This highlights a problem with maintaining fixed word representations: albeit pre-trained on extremely large corpora, the embeddings of rare words need to be complemented with information emanating from their context. Our specific parameterization allows observing this directly, but it may very well be an implicit burden placed on any contextualizing encoder such as a vanilla BiLSTM.

Model EM F1
RaSoR (base model) 70.6 78.7
RaSoR + TR(MLP) 72.5 79.9
RaSoR + TR 75.0 82.5
RaSoR + TR + LM(emb) 75.8 83.0
RaSoR + TR + LM(L1) 77.0 84.0
RaSoR + TR + LM(L2) 76.1 83.3
Table 1: Results on SQuAD’s development set. The EM metric measures an exact-match between a predicted answer and a correct one and the F1 metric measures the overlap between their bag of words.
Figure 1: Average gate activations.
Model EM F1
BiDAF + Self Attention + ELMo [1] 78.6 85.8
RaSoR + TR + LM(L1) [2] 77.6 84.2
SAN [3] 76.8 84.4
r-net [4] 76.5 84.3
FusionNet [5] 76.0 83.9
Interactive AoA Reader+ [6] 75.8 83.8
RaSoR + TR [7] 75.8 83.3
DCN+ [8] 75.1 83.1
Conductor-net [9] 73.2 81.9
RaSoR (base model) [10] 70.8 78.7
Table 2: Single-model results on SQuAD’s test set.333From SQuAD’s leaderboard per Dec 13, 2017.
[1] Peters et al. (2018) [2,7] This work. [3] Liu et al. (2017b) [4] Wang et al. (2017) [5] Huang et al. (2017) [6] Cui et al. (2017) [8] Xiong et al. (2017a) [9] Liu et al. (2017a) [10] Lee et al. (2016)

Incorporating language model representations

Supplementing the calculation of token re-embeddings with the hidden states of a strong language model proves to be highly effective. In Table 1 we list development set results for using either the LM hidden states of the first stacked LSTM layer or those of the second one. We additionally evaluate the incorporation of that model’s word-type representations (referred to as RaSoR + TR + LM(emb)), which are based on character-level embeddings and are naturally unaffected by context around a word-token.

Overall, we observe a significant improvement with all three configurations, effectively showing the benefit of training a QA model in a semi-supervised fashion Dai and Le (2015) with a large language model. Besides a crosscutting boost in results, we note that the performance due to utilizing the LM hidden states of the first LSTM layer significantly surpasses the other two variants. This may be due to context being most strongly represented in those hidden states as the representations of LM(emb) are non-contextual by definition and those of LM(L2) were optimized (during LM training) to be similar to parameter vectors that correspond to word-types and not to word-tokens.

In Table 3 we list the top-scoring single-model published results on SQuAD’s test set, where we observe RaSoR + TR + LM(L1) ranks second in EM, despite having only minimal question-passage interaction which is a core component of other works. An additional evaluation we carry out is following Jia and Liang (2017), which demonstrated the proneness of current QA models to be fooled by distracting sentences added to the paragraph. In Table 3 we list the single-model results reported thus far and observe that the utilization of LM-based representations carried out by RaSoR + TR + LM(L1) results in improved robustness to adversarial examples.

Model AddSent AddOneSent
RaSoR + TR + LM(L1) [1] 47.0 57.0
Mnemonic Reader [2] 46.6 56.0
RaSoR + TR [3] 44.5 53.9
MPCM [4] 40.3 50.0
RaSoR (base model) [5] 39.5 49.5
ReasoNet [6] 39.4 50.3
jNet [7] 37.9 47.0
Table 3: x Single-model F1 on adversarial SQuAD.
[1,3] This work. [2] Hu et al. (2017) [4] Wang et al. (2016) [5] Lee et al. (2016) [6] Shen et al. (2017) [7] Zhang et al. (2017)

5 Experimental setup

We use pre-trained GloVe embeddings Pennington et al. (2014) of dimension and produce character-based word representations via convolutional filters over character embeddings as in Seo et al. (2016). For all BiLSTMs, hyper-parameter search included the following values, with model selection being done according to validation set results (underlined): number of stacked BiLSTM layers , number of cells , dropout rate over input , dropout rate over hidden state . To further regularize models, we employed word dropout (Iyyer et al., 2015; Dai and Le, 2015) at rate and couple LSTM input and forget gate as in Greff et al. (2016). All feed-forward networks and the MLP

employed the ReLU non-linearity

Nair and Hinton (2010) with dropout rate , where the single hidden layer of the FFs was of dimension and the best performing MLP consisted of 3 hidden layers of dimensions , and . For optimization, we used Adam Kingma and Ba (2015) with batch size .

6 Related Work

Our use of a Highway layer with RNNs is related to Highway LSTM Zhang et al. (2016) and Residual LSTM Kim et al. (2017). The goal in those works is to effectively train many stacked LSTM

layers and so highway and residual connections are introduced into the definition of the

LSTM function. Our formulation is external to that definition, with the specific goal of gating between LSTM hidden states and fixed word-embeddings.

Multiple works have shown the efficacy of semi-supervision for NLP tasks Søgaard (2013). Pre-training a LM in order to initialize the weights of an encoder has been reported to improve generalization and training stability for sequence classification (Dai and Le, 2015) as well as translation and summarization (Ramachandran et al., 2017).

Similar to our work, Peters et al. (2017) utilize the same pre-trained LM from Józefowicz et al. (2016) for sequence tagging tasks, keeping encoder weights fixed during training. Their formulation includes a backward LM and uses the hidden states from the top-most stacked LSTM layer of the LMs, whereas we also consider reading the hidden states of the bottom one, which substantially improves performance. In parallel to our work, Peters et al. (2018) have successfully leveraged pre-trained LMs for several tasks, including RC, by utilizing representations from all layers of the pre-trained LM.

In a transfer-learning setting,

McCann et al. (2017) pre-train an attentional encoder-decoder model for machine translation and show improvements across a range of tasks when incorporating the hidden states of the encoder as additional fixed inputs for downstream task training.

7 Conclusion

In this work we examine the importance of context for the task of reading comprehension. We present a neural module that gates contextual and non-contextual representations and observe gains due to context utilization. Consequently, we inject contextual information into our model by integrating a pre-trained language model through our suggested module and find that it substantially improves results, reaching state-of-the-art performance on the SQuAD dataset.


We thank the anonymous reviewers for their constructive comments. This work was supported by the Israel Science Foundation, grant 942/16, and by the Yandex Initiative in Machine Learning.