Comparative Analysis of Neural QA models on SQuAD

by   Soumya Wadhwa, et al.
Carnegie Mellon University

The task of Question Answering has gained prominence in the past few decades for testing the ability of machines to understand natural language. Large datasets for Machine Reading have led to the development of neural models that cater to deeper language understanding compared to information retrieval tasks. Different components in these neural architectures are intended to tackle different challenges. As a first step towards achieving generalization across multiple domains, we attempt to understand and compare the peculiarities of existing end-to-end neural models on the Stanford Question Answering Dataset (SQuAD) by performing quantitative as well as qualitative analysis of the results attained by each of them. We observed that prediction errors reflect certain model-specific biases, which we further discuss in this paper.


Learning to Paraphrase for Question Answering

Question answering (QA) systems are sensitive to the many different ways...

MultiReQA: A Cross-Domain Evaluation for Retrieval Question Answering Models

Retrieval question answering (ReQA) is the task of retrieving a sentence...

Exploring The Spatial Reasoning Ability of Neural Models in Human IQ Tests

Although neural models have performed impressively well on various tasks...

On the Transferability of Minimal Prediction Preserving Inputs in Question Answering

Recent work (Feng et al., 2018) establishes the presence of short, unint...

Text Embeddings for Retrieval From a Large Knowledge Base

Text embedding representing natural language documents in a semantic vec...

What do we expect from Multiple-choice QA Systems?

The recent success of machine learning systems on various QA datasets co...

Exploring the Ideal Depth of Neural Network when Predicting Question Deletion on Community Question Answering

In recent years, Community Question Answering (CQA) has emerged as a pop...

1 Introduction

Machine Reading is a task in which a model reads a piece of text and attempts to formally represent it or performs a downstream task like Question Answering (QA). Neural approaches to the latter have gained a lot of prominence especially owing to the recent spur in developing and publicly releasing large datasets on Machine Reading and Comprehension (MRC). These datasets are created from different underlying sources such as web resources in MS MARCO Nguyen et al. (2016); trivia and web in QUASAR-S and QUASAR-T Dhingra et al. (2017), SearchQA Dunn et al. (2017), TriviaQA Joshi et al. (2017); news articles in CNN/Daily Mail Chen et al. , NewsQA Trischler et al. (2016) and stories in NarrativeQA Kočiskỳ et al. (2017). Another common source is large unstructured text documents from Wikipedia such as in SQuAD Rajpurkar et al. (2016), WikiReading Hewlett et al. (2016) and WikiHop Welbl et al. (2017). These different sources implicitly affect the nature and properties of questions and answers in these datasets. Based on the dataset, certain neural models capitalize on these biases while others are unable to. The ability to generalize across different sources and domains is a desirable characteristic for any machine reading system. Evaluating and analyzing systems on QA tasks can lead to insights for advancements in machine reading and natural language understanding, and Peñas et al. (2011) have also previously worked on this.

One of the first large MRC datasets (over 100k QA pairs) is the Stanford Question Answering Dataset (SQuAD) Rajpurkar et al. (2016). For its collection, different sets of crowd-workers formulated questions and answers using passages obtained from 500 Wikipedia articles. The answer to each question is a span in the given passage, and many effective neural QA models have been developed for this dataset. Our main focus in this work is to perform comparative subjective and empirical analysis of errors in answer predictions by four top performing models on the SQuAD leaderboard111

We focused on Bi-Directional Attention Flow (BiDAF) Seo et al. (2016), Gated Self-Matching Networks (R-Net) Wang et al. (2017), Document Reader (DrQA) Chen et al. (2017), Multi-Paragraph Reading Comprehension (DocQA) Clark and Gardner (2017)

, and the Logistic Regression baseline model

Rajpurkar et al. (2016)

We mainly choose these models since they have comparable high performance on the evaluation metrics and it is easy to replicate their results due to availability of open source implementations. While we limit ourselves to in-domain analysis of the performance of these models on SQuAD in this paper, similar principles can be used to extend this work to study biases of combinations of different models on different datasets and thereby understand the generalization capabilities of these neural architectures.

The organization of the paper is as follows. Section 2 gives a comprehensive overview of the models that are compared in further sections. Section 3 describes the different experiments we conducted, and discusses our observations. In Section 4, we summarize our main conclusions from this work and describe our vision for the future.

2 Relevant Neural Models

We present a brief overview of the models which we considered for our analysis in this section.

Bi-Directional Attention Flow (BiDAF):

This model, proposed by Seo et al. (2016)

, is a hierarchical multi-stage end-to-end neural network which takes inputs of different granularity (character, word and phrase) to obtain a query-aware context representation using memory-less context-to-query (C2Q) and query-to-context (Q2C) attention. This representation can then be used for different final tasks. Many versions of this model (with different types of input features) exist on the SQuAD leaderboard, but the basic architecture

222 (which we use for our experiments in this paper) contains character, word and phrase embedding layers, followed by an attention flow layer, a modeling layer and an output layer.

Gated Self-Matching Networks (R-Net):

This model, proposed by Wang et al. (2017), is a multi-layer end-to-end neural network whose novelty lies in the use of a gated attention mechanism so as to give different levels of importance to different passage parts. It also uses self-matching attention for the context to aggregate evidence from the entire passage to refine the query-aware context representation obtained. The architecture contains character and word embedding layers, followed by question-passage encoding and matching layers, a passage self-matching layer and an output layer. The implementation we used333 had some minor changes for increased efficiency.

Document Reader (DrQA):

This model, proposed by Chen et al. (2017), focuses on answering open-domain factoid questions using Wikipedia, but also performs well on SQuAD (skipping the document retrieval stage). Its implementation444

has paragraph and question encoding layers, and an output layer. The paragraph encoding is computed by representing each context as a sequence of feature vectors derived from tokens: word embedding, exact match with question word, POS/NER/TF and aligned question embedding, and passing these as inputs to a recurrent neural network. The question encoding is obtained by using word embeddings as inputs to a recurrent neural network.

Multi-Paragraph Reading Comprehension (DocQA):

This model, proposed by Clark and Gardner (2017), aims to answer questions based on entire documents (multiple paras) rather than specific paragraphs, but also gives good results for SQuAD (considering the given paragraph as the document). The implementation555 contains input, embedding (character and word-level), pre-processing (shared bidirectional GRU between question and passage), attention (similar to BiDAF), self-attention (residual) and output (bidirectional GRU and linear scoring) layers.

Logistic Regression (LR):

This model was proposed as a baseline in the SQuAD dataset paper Rajpurkar et al. (2016)

and uses features based on n-gram frequencies, lengths, part-of-speech tags, constituency and dependency parse trees of questions and passages as inputs to a logistic regression classifier

666 to predict whether each constituent span is an answer or not.

3 Experiments and Discussion

We trained the aforementioned end-to-end neural models and compare their performance on the SQuAD development set which contains 10,570 question-answer pairs based on Wikipedia articles.

3.1 Quantitative Analysis

To perform a systematic comparison of errors across different models, we investigate the predictions based on the following criteria.

3.1.1 Span-Level Performance

The span-level performance is measured typically by Exact Match (EM) and F1 metrics which are reported with respect to the ground truth answer spans. These results are summarized in Table 1. The DocQA model gives the best overall performance which aligns well with our expectation, owing to the usage of and improvements in the prior mechanisms introduced in BiDAF and R-Net.

Model BiDAF R-Net DrQA DocQA LR
EM (%) 67.67 70.12 66.00 71.60 40.14
F1 (%) 77.31 78.94 76.28 80.78 50.98
Correct Sentence (%) 91.05 92.37 92.40 93.77 83.30
Table 1: Span and Sentence Level Performance

3.1.2 Sentence-Level Performance

To investigate trends at different granularities, we also measure sentence retrieval performance. The context given for each question-answer pair is split into sentences using the NLTK sentence tokenizer777, and the sentence-level accuracy of each of the models is computed (Table 1

). Since the default sentence tokenizer for English in NLTK is pre-trained on Penn Treebank data which contains formal language (news articles), we expect it to perform reasonably well on Wikipedia articles too. We observe that all the models have high sentence-level accuracy, with DocQA outperforming the other models with respect to this metric as well. Interestingly, DrQA performs better on sentence retrieval accuracy than both BiDAF and R-Net, but has a worse span-level exact match score, which is probably because of the rich feature vector representation of the passage due to the model’s focus on open domain QA (and hence retrieval). But, none of these neural models have near-perfect ability to identify the correct sentence, and

90% accuracy indicates that even if we have a perfect answer selection method, this is the best EM score we can achieve. However, incorrect span identification contributes more to errors in prediction for all the models, as seen from the disparity between the sentence-level accuracies and the final span-level exact match score values.

3.1.3 Passage Length Distribution

We analyze the impact of passage length on errors, since this can be an important factor in determining the difficulty of understanding the passage. As seen in Figure 1, DocQA performs the best on shorter passages, while R-Net and BiDAF are observed to be better for longer passages. However, there are no systematic error patterns and overall error rates, surprisingly, are not much higher for longer passages. This means that predictions on long passages are almost as good as on short (presumably easier to understand) passages.

Figure 1: Percentage of total QA pairs for each range of passage lengths which have incorrect predictions by different models

3.1.4 Question Length Distribution

We also do a similar error analysis for questions of different lengths. Since there are very few questions which have length greater than 30, the estimate for range 30-34 is not very reliable. In Figure

2, we observe that the error rate first decreases and then increases for BiDAF, DrQA and DocQA. A plausible explanation for this is that shorter questions contain insufficient information in order to be able to select the correct answer span and can hence be confusing, but it also becomes difficult for end-to-end neural models to learn a good representation when the question becomes longer and syntactically more complicated. However, R-Net has an irregular trend with respect to question length, which is difficult to explain.

Figure 2: Percentage of total QA pairs for each range of question lengths which have incorrect predictions by different models

3.1.5 Answer Length Distribution

For answers of varying lengths, the error rates are shown in Figure 3. Again, estimates for answers with length 16 are not very reliable since data is sparse for high answer lengths. Here, we observe an increasing trend initially and then a slight decrease (bell shape). This conforms to the hypothesis that shorter answers are easier to predict than longer answers, but only up to a certain answer length (observed to be around 7 for most models). The slightly better performance for very long answers is likely due to such answers having a higher chance of being (almost) entire sentences with simpler questions being asked about them.

Figure 3: Percentage of total QA pairs for each answer length which have incorrect predictions by different models

3.1.6 Error Overlap

In Table 2, we analyze the number of erroneous predictions which overlap for different pairs of models, i.e., which belong to the intersection of the sets of incorrect answers generated by models in each (row, column) pair. Thus, the values in the table represent a symmetric matrix with diagonal elements indicating the number of errors which each model commits. This analysis can be useful while determining suitable models for creating meta ensembles since a low incorrect answer overlap indicates that the combined predictive power of the pair of models under consideration is high. We observe that most overlap values are in the range 20-25% indicating that an ensemble might give considerably better performance than individual models. DocQA paired with other models generates low values, as expected, but the least value is observed for the DocQA-DrQA pair probably because they both use very different feature representations and architectures, and hence generate diverse outputs. Note that DrQA is not the second best performing model (among the ones we analyzed) when considered independently, but might add more value to an ensemble because of the observed answer overlap trends.

Model BiDAF R-Net DrQA DocQA LR
BiDAF 32.33 21.97 22.56 21.22 26.58
R-Net 21.97 29.88 22.06 21.35 24.99
DrQA 22.56 22.06 34.00 20.95 27.49
DocQA 21.22 21.35 20.95 28.40 23.59
LR 26.58 24.99 27.49 23.59 59.86
Table 2: Incorrect Answer Overlap (%)

One way in which this analysis can help in exploring ensemble-based methods is that instead of trying all possible combinations of models, we can adopt a greedy approach based on the incorrect answer overlap metric to decide which model to add to the ensemble (and only if it leads to a statistically significant difference in this overlap). After determining an approximately optimal set of models which such an ensemble should be composed of, each of these models can be trained independently followed by multi-label classification (to select one of the generated answers) using techniques like logistic regression, a feed-forward neural network or a recurrent or convolutional neural network with input features based on the question, the passage and their token overlap. The entire network can also be trained end-to-end.

Also, all 5 models combined have an error overlap of 13.68%, i.e., if we had a mechanism to perfectly choose between these models, we would get an Exact Match score of 86.32%. This indicates that future work based on ensembling different neural models can give promising results and is worth exploring.

An example of a passage-question-answer that all of the models get wrong is:

Passage: The University of Warsaw was established in 1816, when the partitions of Poland separated Warsaw from the oldest and most influential Polish academic center, in Krakow. Warsaw University of Technology is the second academic school of technology in the country, and one of the largest in East-Central Europe, employing 2,000 professors. Other institutions for higher education include the Medical University of Warsaw, the largest medical school in Poland and one of the most prestigious, the National Defence University, highest military academic institution in Poland, the Fryderyk Chopin University of Music the oldest and largest music school in Poland, and one of the largest in Europe, the Warsaw School of Economics, the oldest and most renowned economic university in the country, and the Warsaw University of Life Sciences the largest agricultural university founded in 1818.
Question: What is one of the largest music schools in Europe?
Answer: Fryderyk Chopin University of Music

This passage-question-answer is difficult for automatic processing because there several entities of the same type (school / university) in the passage, and the question is a paraphrase of one segment of a very long, syntactically complicated sentence which contains the information required to be able to infer the correct answer. This presents an interesting challenge, and such qualitative observations can be used to formulate a general technique for effectively testing machine reading systems.

3.2 Qualitative Analysis

For qualitative error analysis, we sample 100 incorrect predictions (based on EM) from each model and try to find common error categories. Broadly, the errors observed were either because of incorrect answer span boundaries or inability to infer the meaning of the question / passage. Examples of each error type are shown in Table 3, and these are further described below.

Error Type Passage Question Predicted Answer
Incorrect answer boundary (longer) … survey of 4,745 North American Lutherans aged 15-65 found that, compared to the other minority groups under consideration, Lutherans were the least prejudiced toward Jews. Nevertheless, Professor Richard (Dick) Geary, … What did a survey of North American Lutherans find that Lutherans felt about Jews compared to other minority groups? 15-65 found that, compared to the other minority groups under consideration, Lutherans were the least prejudiced toward Jews
Incorrect answer boundary (shorter) … In the United States, in order for a prescription for a controlled substance to be valid, it must be issued for a legitimate medical purpose by a licensed practitioner acting in the course of legitimate doctor-patient relationship. The filling … What conditions must be met to prescribe a controlled substance? issued for a legitimate medical purpose
Soft Correct … for that time. The vBNS installed one of the first ever production OC-48c (2.5 Gbit/s) IP links in February 1999 and went on to upgrade the entire backbone … What did the network install in 1999? OC-48c (2.5 Gbit/s) IP links
Multi-Sentence User Datagram Protocol (UDP) is an example of a datagram protocol. In the virtual call system … model. The X.25 protocol suite uses this network type. X.25 uses what type network type? protocol suite
Paraphrase … rather than consumers. There is no known case of any U.S. citizens buying Canadian drugs for personal use with a prescription, who has ever been charged by authorities. Has there ever been anyone charged with importing drugs from Canada for personal medicinal use? has ever been charged by authorities
Same Entity Type / Unit Confusion … after the 1973 oil crisis, Honda, Toyota and Nissan, affected by the 1981 voluntary export restraints, opened US assembly plants and established their luxury divisions (Acura, Lexus and Infiniti, respectively) to distinguish themselves from their mass-market brands. Name a luxury division of Toyota. Acura, Lexus and Infiniti
Requires World Knowledge … disobedience in opposition to the decisions of non-governmental agencies such as trade unions, banks, and private universities can be justified if … What public entity of learning is often target of civil disobedience? governmental
Missing Inference Killer T cells are a sub-group of T cells that kill cells that are infected with viruses (and other pathogens), or are otherwise damaged or dysfunctional. As with B cells … What kind of T cells kill cells that are infected with pathogens? sub-group
Table 3: Examples of error types observed in the qualitative analysis - blue indicates ground truth

3.2.1 Boundary-Based Errors

Incorrect answer boundary (longer):

This error category includes those cases where the predicted span is longer than the ground truth answer, but contains the answer.

Incorrect answer boundary (shorter):

This error category includes those cases where the predicted span is shorter than the ground truth answer, and is a substring of the answer.

Soft Correct:

This error category includes those cases where the prediction is actually correct, but due to inclusion / exclusion of certain question terms (such as units) along with the answer, it is deemed incorrect.

3.2.2 Inference-Based Errors


This error category includes those cases where inference is required to be performed across 2 or more sentences in the given passage to be able to arrive at the answer, which leads to an incorrect prediction based on only 1 passage sentence.


This error category includes those cases where the question paraphrases certain parts of the sentence that it is asking about which makes lexical pattern matching difficult and leads to errors in prediction.

Same Entity Type Confusion / Unit Confusion:

This error category includes those cases where the question is about an entity type which is present multiple times in the passage and the model returns a different entity than the ground truth entity but of the same type.

Requires World Knowledge:

This error category includes questions which can not be answered using the given passage alone and require external knowledge to solve, leading to incorrect predictions.

Missing Inference:

This category includes inference-related errors which don’t belong to any of the other categories mentioned above.

3.2.3 Observations

In this section, we record the main observations from our qualitative error analysis and analyze potential reasons for the error trends observed. Figure 4 shows the different types of errors in predictions by various models.

Figure 4: Distribution of errors by various models across different categories using manual inspection

We observe that BiDAF makes many boundary-based errors which indicates that a better output layer (since this is responsible for span identification – although errors might have percolated from previous layers, most of these are cases where the model almost got the correct answer but not exactly) or some post-processing of the answer might help improve performance. Paraphrases also contribute to almost 15% of errors observed which indicates that the question and the relevant parts of the context are not effectively matched in these cases.

We observe that R-Net makes fewer boundary errors, perhaps because self-attention enables it to accumulate evidence and return better answer spans, although this leads to more errors of the ‘shorter’ answer type than ‘longer’. Also, missing inference contributes to almost 20% of the observed errors (not including multiple sentences or paraphrases).

Paraphrasing is the most frequent error category observed for DrQA, which makes sense if we consider the features used to represent each passage, such as exact match with a question word, which depend on lexical overlap between the question and passage.

We observe that DocQA makes many boundary errors too, again making more mistakes by predicting shorter answers than expected in most of the observed cases. A better root cause analysis can be performed by visualizing outputs from different layers and evaluating these, and we leave this in-depth investigation to future work. Also, the high number of Soft Correct outputs across all models points to some deficiencies in the SQuAD annotations, which might limit the reliability of the performance evaluation metrics.

Although these state-of-the-art deep learning models for machine reading are supposed to have inference capabilities, our error analysis above points to their limitations. These insights can be useful for developing benchmarks and datasets which enable realistic evaluation of systems which aim to ‘solve’ the RC task. In

Wadhwa et al. (2018), we take a first step in this direction by proposing a method focused on questions involving referential inference, a setting to which these models fail to generalize well.

4 Conclusion and Future Work

In this work, we analyze - both quantitatively and qualitatively - results generated by 4 end-to-end neural models on the Stanford Question Answering Dataset. We observe interesting trends in the analysis, with some error patterns which are consistent across different models and some others which are specific to each model due to their different input features and architectures. This is important to be able to interpret and gain an intuition for the effective functions that different components in a neural model architecture perform versus their intended functions, and also to understand model-specific biases. Eventually, this can enable us to come up with new models including specific components which tackle these errors. Alternatively, the overlap analysis demonstrates that learning ensembles of different neural models to combine their individual strengths and quirks might be an interesting direction to explore to achieve better performance.

Even though the scope of this paper is restricted to SQuAD, similar analysis can be done for any datasets / models / features, to gain a better understanding and enable a better assessment of state-of-the-art in neural machine reading. To this end, we also performed some preliminary experiments on TriviaQA so as to analyze the difference between the properties of the two datasets, but were unable to replicate the published results owing to pre-processing / hyperparameters. We will continue to work on this since the ability of a model to generalize and to be able to learn from a particular domain and transfer some knowledge to a different domain is a very exciting research area.

We also believe that such analysis can help curate datasets which are better indicators of the actual natural language ‘reading’ and ‘comprehending’ capabilities of models rather than falling prey to shallow pattern matching. One way to achieve this is by building new challenges that are specifically designed to put pressure on the identified weaknesses of neural models. Thus, we can move towards the development of datasets and models which truly push the envelope of the challenging machine reading task.


We would like to thank Chaitanya Malaviya and Abhishek Chinni for their valuable feedback, and the Language Technologies Institute at CMU for the GPU resources used in this work. We are also very grateful to the anonymous reviewers for their insightful comments and suggestions, which helped us polish the presentation of our work.