Log In Sign Up

Robust Question Answering Through Sub-part Alignment

Current textual question answering models achieve strong performance on in-domain test sets, but often do so by fitting surface-level patterns in the data, so they fail to generalize to out-of-distribution and adversarial settings. To make a more robust and understandable QA system, we model question answering as an alignment problem. We decompose both the question and context into smaller units based on off-the-shelf semantic representations (here, semantic roles), and solve a subgraph alignment problem to find a part of the context which matches the question. Our model uses BERT to compute alignment scores, and by using structured SVM, we can train end-to-end despite complex inference. Our explicit use of alignments allows us to explore a set of constraints with which we can prohibit certain types of bad behaviors which arise in cross-domain settings. Furthermore, by investigating differences in scores across different potential answers, we can seek to understand what particular aspects of the input led the model to choose the answer it did without relying on "local" post-hoc explanation techniques. We train our model on SQuAD v1.1 and test it in several adversarial and out-of-domain datasets. The results show that our model is more robust cross-domain than the standard BERT QA model, and constraints derived from alignment scores allow us to effectively trade off coverage and accuracy.


page 1

page 2

page 3

page 4


Don't Take the Easy Way Out: Ensemble Based Methods for Avoiding Known Dataset Biases

State-of-the-art models often make use of superficial patterns in the da...

MultiReQA: A Cross-Domain Evaluation for Retrieval Question Answering Models

Retrieval question answering (ReQA) is the task of retrieving a sentence...

Improving Lexical Embeddings for Robust Question Answering

Recent techniques in Question Answering (QA) have gained remarkable perf...

Model Agnostic Answer Reranking System for Adversarial Question Answering

While numerous methods have been proposed as defenses against adversaria...

Multi-passage BERT: A Globally Normalized BERT Model for Open-domain Question Answering

BERT model has been successfully applied to open-domain QA tasks. Howeve...

MCQA: Multimodal Co-attention Based Network for Question Answering

We present MCQA, a learning-based algorithm for multimodal question answ...

Question-Answering with Grammatically-Interpretable Representations

We introduce an architecture, the Tensor Product Recurrent Network (TPRN...

1 Introduction

Current text-based question answering models learned end-to-end often rely on spurious patterns between the question and context rather than learning the desired behavior. They might be able to ignore the question entirely Kaushik and Lipton (2018), focus primarily on the answer type Mudrakarta et al. (2018), or otherwise ignore the “intended” mode of reasoning for the task Chen and Durrett (2019); Niven and Kao (2019). Thus, these models are not robust to adversarial attacks Jia and Liang (2017); Iyyer et al. (2018); Wallace et al. (2019): their reasoning processes are brittle, so they can be fooled by surface-level distractors that look laughable to humans. Methods like adversarial training Miyato et al. (2016); Wang and Bansal (2018); Lee et al. (2019); Yang et al. (2019), data augmentation Welbl et al. (2020), and posterior regularization Pereyra et al. (2016); Zhou et al. (2019) have been proposed to improve robustness. However, these settings are often optimized around a certain type of error, and it remains unclear how to dynamically adapt these models to new adversarial settings that may come along.

Figure 1: A typical adversarial example on SQuAD, where the model picks the adversarial answer. By breaking the question and context to smaller units, we can expose the error (the wrong entity match) and use explicit constraints to fix it.

In this paper, we explore a model for text-based question answering through sub-part alignment. The core idea behind our method is that if every sub-part of the question is well supported by the answer context, then the answer produced should be trustable; if not, we have a sense that the model is making an incorrect prediction. For instance, Figure 1 shows an adversarial example of SQuAD Jia and Liang (2017) where a standard BERT QA model predicts the wrong answer August 18,1991, and we do not know why. However, if we can decompose the question into smaller units, we can see that it is because Super Bowl 50 aligns to Champ Bowl and misleads the model. By exposing this error directly, we make it easier to subsequently patch, as we discuss later.

Specifically, we incorporate Semantic Role Labeling (SRL) to decompose the sentences of the question and context into predicates and corresponding arguments. Then we view the question answering procedure as a constrained graph alignment problem where the nodes are represented by the predicates and arguments, and the edges are formed by relations between them (e.g. predicate-argument relations and coreference relations). Our question should align to a local subgraph in the context, though our process is more flexible than graph alignments used in prior work Sachan and Xing (2016); Khashabi et al. (2018). Once we complete the alignment, the node aligned to the wh-span should contain the answer, so we use a standard QA model to extract the answer from this span. Note that while we use SRL in this work, our model could work with any graph-structured semantic representation, including AMR Sachan and Xing (2016).

Each pair of aligned nodes is scored using BERT Devlin et al. (2019)

; these alignment scores are then plugged into a beam search procedure to find the optimal graph alignment subject to constraints. This structured alignment model can be trained as a structured support vector machine (SSVM) to minimize alignment error with heuristically-derived oracle alignments subject to graph constraints. The alignment scores are computed in a black-box way, so the model does not necessarily produce token-level explanations 

Jain and Wallace (2019); however, the score of an answer is directly a sum of the score of each aligned piece, making this structured prediction phase of the model “faithful by construction.” Critically, this allows us to understand what parts of the alignment are responsible for a prediction, and if needed, constrain the behavior of the alignment to correct for certain types of errors.

We view this interpretability and extensibility with constraints as one of the principal advantages of our model. As such, we train our model on the normal SQuAD Rajpurkar et al. (2016) and focus on performance on out-of-domain data and on two different adversarial datasets, namely SQuAD Adversarial Jia and Liang (2017) and Universal Triggers on SQuAD Wallace et al. (2019) to probe the model’s behavior under different adversarial settings when it has only been exposed to “clean” training examples. Our framework allows us to incorporate natural constraints on alignment scores to improve zero-shot performance in adversarial settings. Finally, our model’s alignments serve as “explanations” for its prediction, allowing us to ask why certain predictions were made over others and examine scores for hypothetical other answers the model could give.

Figure 2: The constructed graph based on an example on SQuAD dev. Here Super Bowl 50 and the game are connected by a coreference edge. The edge from was to determine is formed through a predicate nested inside an argument. The oracle alignment (Section 3.4) is shown with dotted lines.

2 Question Answering as Graph Alignment

Our approach critically relies on the ability to decompose questions and answers into a graph over text spans. Our model can in principle work for a range of syntactic and semantic structures, including dependency parsing, SRL Palmer et al. (2005), and AMR Banarescu et al. (2013). We use SRL in this work and augment it with coreference links, due to the high performance and flexibility of current SRL parsers Shi and Lin (2019); Peters et al. (2018).

Graph Construction

An example of the graph we construct is shown in Figure 2. Both the question and passage are represented as graphs where the nodes consist of predicates and arguments. Edges are undirected and connect each predicate and its corresponding arguments. Since SRL only captures the predicate-argument relations within one sentence, we add context information to the graph by adding coreference edges: if two arguments are in the same coreference cluster, we add an edge between them. Finally, in certain cases involving verbal or clausal arguments, there might exist nested structures where an argument to one predicate contains a separate predicate-argument structure. In this case, we remove the larger argument and add an edge directly between the two predicates. This is shown by the edge from was to determine (labeled as nested structure) in Figure 2). Breaking down such large arguments chiefly helps in avoiding ambiguity during alignment.

The alignment structure between the question and context has been proven to be useful for question answering in some previous work Khashabi et al. (2018); Sachan and Xing (2016); Sachan et al. (2015). Our framework differs from theirs in that it incorporates a much stronger alignment model (BERT), allowing us to significantly relax the alignment constraints while still achieving high performance.

Alignment Constraints

Once we have the constructed graph, we can align each node in the question to its counterpart in the graph. In this work, we control the alignment behavior by placing explicit constraints on this process. For example, we place a locality constraint on the alignment, meaning that every two aligned nodes in the context graph can not be too far away from each other. Specifically, we constrain adjacent pairs of question nodes to align no more than nodes apart in the context. means we are aligning the question to a sub-graph in the context, means we can align to a node anywhere in the context graph. In the following sections, we will discuss more about the constraints.

3 Graph Alignment Model

3.1 Model

We now describe the graph alignment process. Let represent the text of the passage and question concatenated together. Assume a decomposed question graph with vertices represented by vectors , and a decomposed context with vertices represented by vectors . Let be an alignment of question nodes to passage nodes, where indicates the alignment of the th question node. Each question node is aligned to exactly one passage node, and multiple question nodes can align to the same passage node.

We frame question answering as a maximization over possible alignments:

s.t. constraints on are satisfied

that maximizes a scoring function under some constraints. In this paper, we simply choose to be the sum over the scores of all alignment pairs , where denotes the alignment score between a question node and a context node . This function relies on BERT Devlin et al. (2019) to compute embeddings of the question and context nodes and will be described more precisely in what follows. We will train this model as a structured support vector machine (SSVM), described in Section 3.2.

Figure 3: Alignment scoring. Here the alignment score is computed by the dot product between span representations of question and context nodes. The final alignment score (not shown) is the sum of these edge scores.


Our alignment scoring function is shown in Figure 3. Given a document and a question of raw text, we first concatenate the question with the document and then encode them using the pre-trained BERT encoder Devlin et al. (2019). We then extract the representation for each node (predicates and arguments) in the question and context using a span extractor, which in our case is simply mean pooling over the token representations. For example, the representation of a node in the document is given by , where and denote the span start position and span end position of in the text . The node representation in the question can be computed in the same way.

In this work, we choose , the dot-product between the corresponding node representations and as introduced in the above section.

Answer extraction

Our model so far produces an alignment between question nodes and passage nodes. We assume that one question node contains a wh-word or otherwise targets the answer. Theoretically, the wh-aligned passage node should correspond to the answer, but in practice it may not always exactly. For example, in Figure 2, the wh-alignment is on February 7, 2016 but we only need the actual date February 7, 2016 as an answer. We resolve this by using a standard text-based QA model to extract the actual answer, namely the standard BERT QA to do this job. To train the BERT model, we treat all arguments in the context that contain answer as the “context” of BERT QA.

3.2 Training

We train our model as an instance of a structured support vector machine (SSVM). Ignoring the regularization term, this objective can be viewed as a sum over the training data of structured hinge losses with the following formulation:


where denotes the predicted alignment, is the oracle alignment, and Ham is the Hamming loss between these two. To get the predicted alignment during training, we need to run loss-augmented inference as we will discuss in the next section. When computing the alignment for node , if , we add 1 to the alignment score to account for the loss term in the above equation. Intuitively, this objective requires the score of the gold prediction to be larger than any other hypothesis by a margin of .

When training our system, we first do several iterations of local training where we separately align each node to the oracle in the context without any constraint. The local training helps the global training converge more quickly.

SQuAD normal SQuAD addSent SQuAD addOneSent
ans in wh F1 ans in wh F1 ans in wh F1
local t + local inf 81.3 81.4 34.3 34.0 45.9 46.3
local t + global inf 81.5 81.3 35.1 34.7 46.6 46.9
Subpart Alignment 81.4 81.1 42.3 42.2 59.8 57.8
BERT QA 87.8 39.2 52.6
Table 1: The performance of our proposed model on SQuAD and two adversarial settings from jia2017adversarial. “ans in wh” denotes the percentage of answers found in the span aligned to the wh-span, and F1 denotes the standard QA performance measure. Here for addSent and addOneSent, we only consider the adversarial examples in these datasets.

3.3 Inference

Since our alignment constraints do not strongly restrict the space of possible alignments (e.g., by enforcing a one-to-one alignment with a connected subgraph), searching over all valid alignments is intractable. We therefore use beam search to find the approximate highest-scoring alignment as follows: (1) We initialize the beam with the node pairs associated with the top highest alignment scores, where is the beam size. (2) For each hypothesis in the beam, we compute a set of reachable nodes based on the currently aligned pairs under the locality constraint. (3) Extend the current hypothesis by adding each of these possible alignments and accumulating its score. We continue beam search until all the nodes in the question are aligned and then return the highest-scoring hypothesis.

An example of one step of beam hypothesis expansion is shown in Figure 4. In this state, the two played nodes are already aligned. In any valid alignment, the neighbors of the played question node must be aligned within two nodes of the played passage node to respect the locality constraint. We therefore only consider alignments for the game, on Feb 7, 2016 and Super Bowl 50 as new reachable nodes. Then the alignment scores between all reachable nodes and the remaining nodes in the question are computed and used to extend the beam hypotheses. The highest scoring hypothesis in the next beam ends up aligning the two Super Bowl 50 nodes.

Note that this inference procedure allows us to easily incorporate other constraints as well. For instance, we could require a “hard” match on entity nodes, meaning that two nodes containing entities can only be matched if they share exact the same entities. In this sense, as shown in the figure, Super Bowl 50 can never be aligned to on Feb 6, 2016. We discuss such constraints more in Section 5.

Figure 4: An example of how we align a node with constraints. The blue node played denotes it is already aligned. The orange nodes denote all the valid nodes that can be aligned to under the locality constraint at the current step. Here we only demonstrate the alignment candidates for Super Bowl 50, all other unaligned question nodes have the same alignment candidates.

3.4 Oracle Construction

The oracle construction is basically running the inference based on a heuristically computed score matrix , where is computed by the Jaccard similarity between a question node and a context node . Instead of initializing the beam with the best alignment pairs, we first align the wh-argument in the question with the nodes containing the answer in the context and then initialize the beam with legal alignment pairs.

If the Jaccard similarity between a question node and all other context nodes is zero, we treat these as unaligned nodes. During training, our approach can gracefully handle unaligned nodes by treating these as latent variables in structured SVM; the gold “target” is the highest scoring set of alignments consistent with the gold supervision. Procedurally, what we do is run the inference algorithm first on the incomplete oracle (initializing the beam with the already-aligned ndoes) to let the current model decide the best alignments for those unaligned nodes.

4 Adversarial Robustness

Our focus in this work is primarily robustness, interpretability, and controllability of our model. We first focus on adapting to challenging adversarial settings in order to “stress test” our approach.

For all experiments, we train our model only on the unmodified SQuAD-1.1 dataset Rajpurkar et al. (2016) and examine how well it can generalize to adversarial and out-of-domain settings with minimal modification, using no fine-tuning on new data and no data augmentation that would capture the adversarial transformations.

4.1 Baselines

We compare primarily against a standard BERT QA system Devlin et al. (2019). We also investigate a local version of our model, where we only try to align the wh-node, without any global training (local t + local inf). Note that this can work fairly well because the BERT embeddings see the whole question and passage. We can also use our locally-trained alignment model but with our global inference scheme (local t + global inf)

4.2 Adversarial Datasets

Added sentences

jia2017adversarial propose to append an adversarial distracting sentence to the normal SQuAD development set to test the robustness of a QA model. In this paper, we use the two main test sets they introduced: addSent and addOneSent. Both of the two sets augment the normal test set with adversarial samples annotated by Turkers that are designed to look similar to question sentences. In this work, we mainly focus on the adversarial examples.

Universal Triggers

Wallace2019Triggers use the gradient based method to find a short trigger sequence. When they insert the short sequence to the original text, it will trigger the target prediction in the sequence independent of the rest of the passage content or the exact nature of the question. For QA, they generate different triggers for different types of questions including “who”, “when”, “where” and “why”.

4.3 Implementation Details

We set the beam size to 20 for the constrained alignment. We use BERT-base-uncased for all of our experiments, and fine-tune the model using Adam Kingma and Ba (2014) with learning rate set to 2e-5. We use the SpanBERT coreference system Joshi et al. (2020) and the BERT SRL system Shi and Lin (2019). When doing inference, we set the locality constraint . We discard the questions that do not have a valid SRL parse or do not contain a wh word.

Normal addSent addOneSent
overall adv overall adv
R.M-Reader Hu et al. (2018) 86.6 58.5 31.1 67.0 19.6
KAR Wang and Jiang (2018) 83.5 60.1 23.4 72.3 11.2
BERT + Adv Yang et al. (2019) 92.4 63.5 28.9 72.5 19.9
Our BERT 87.8 61.8 39.2 27.0 70.4 52.6 18.4
Subpart Alignment 81.1 60.1 42.2 21.0 68.1 57.8 11.3
Table 2: Performance of our systems compared to the literature on both addSent and addOneSent. Here, overall denotes the performance on the full adversarial set, adv denotes the performance on the adversarial samples alone. represents the performance gap between the normal SQuAD and the overall performance on adversarial set.

4.4 Results on Adversarial SQuAD

The results on the normal SQuAD development set and Adversarial SQuAD are shown in Table 1, we have the following observations:

Our model is not as good as BERT QA on normal SQuAD but outperforms it in adversarial settings.

Compared with the standard BERT QA model, our model indeed is fitting a different data distribution (learning a constrained structure) which makes the task harder. This kind of training scheme does cause some performance drop on normal SQuAD, but we can see that it improves the F1 on addSent and addOneSent by 3.0 and 5.2 respectively. This smaller drop in performance indicates that learning the alignment helps improve the robustness of our model.

Global training and inference substantially improves performance in adversarial settings, despite having no effect in-domain.

Normal SQuAD is a relatively easy dataset and the answer for most questions can be found by simple lexical matching between the question and context. From the result of “local training + local inference”, we see that more than 80% of answers can be located by matching the wh-argument BERT embedding with the passage. However, as there are very strong distractors in SQuAD-Adversarial, wh-argument matching is unreliable. In such situations, the constraints imposed by other argument alignments in the question are quite useful to correct the wrong wh-alignment through global inference. We see that global inference is consistently better than the local inference on both addSent or addOneSent.

Global training in addition to global inference is also important for our model to attain high performance. We find that the locally trained model tends to make overly confident predictions about each separate alignment. Since our global inference objective is maximizing the sum of all alignment scores, the alignment tends to be dominated by those scores. During the objective of global training, the model might be correcting for this by learning to “ignore” certain alignments as long as it can get the final answer using the overall structure.

Once the answer is located, extracting the exact answer span is relatively easy.

Comparing the result of “ans in wh” and “answer F1”, we can see that the actual F1 score is quite similar to the percentage of answers found in the wh-alignment. This indicates that if the actual answer is contained in the wh-alignment, the answer extraction module can do an almost perfect job.

Subpart Alignment BERT
type Normal Trigger Normal Trigger
who 82.2 80.5 87.1 78.5
why 73.5 64.9 76.5 59.7
when 84.0 80.3 90.3 80.9
where 79.7 75.6 84.1 75.8
Table 3: The performance of our model on SQuAD-Adversarial-Triggers. Compared with BERT, our model sees smaller performance drops on all triggers.

4.5 Results on Universal Trigger

The results on different triggers are shown in Table 3. We see that every trigger results in a bigger performance drop on BERT QA than our model. Our model is much more stable, especially on who and when question types, in which case the performance only drops by around 2%. Several factors may contribute to the stability: (1) The triggers are ungrammatical and their arguments often contain seemingly random words, which are likely to get lower alignment scores. (2) Because our model is structured and trained in a different fashion than BERT, adversarial attacks designed for span-based question answering model may not fool our model as effectively.

4.6 Comparison to Existing Systems

We compare our best model (not using constraints from Section 5) with some other models in the literature in Table 2. We note that the overall performance of our model on the two adversarial set is lower compared with BERT, while the performance on adversarial samples alone is higher. That is because we make the task harder – we trade some in-distribution performance to improve the model’s robustness, controllability and explainability. Also, we see that our model achieves the smallest gap on addSent and a comparable gap on addOneSent, which demonstrates that the constrained alignment we proposed is a strong and effective way to enhance the robustness of the model compared to some of previous methods like adversarial training Yang et al. (2019), explicit knowledge integration Wang and Jiang (2018).

5 Generalization via Alignment Constraints

One advantage of our explicit alignments is that we can understand mechanically what the model is doing. This also allows us to add constraints to our model to prohibit certain behaviors, thus allowing us to flexibly adapt our model to this adversarial setting.

Our constraints take the form of either hard constraints on alignments or hard constraints on scores. These alignments may cause all answers to be rejected on a certain example. We therefore evaluate our model’s accuracy at various precision points.

Constraint on Entities

By examine the addSent and addOneSent, we find that it is because the node containing entities in the question align to the adversarial entity nodes in the context that fools the model. An intuitive constraint we can place on the alignment is that we require a hard entity match – for each argument in the question, if it contains entities, it can only align to nodes in the context sharing exact the same entities. We call this type of constraint the “hard entity constraint”.

Constraint on Alignment Scores

The hard entity constraint is quite inflexible and does not generalize to different questions (e.g., questions that do not contain a entity). However, the alignment scores we get during inference time are a good indicator showing how well a specific node pair is aligned. For a correct alignment, every pair should get a reasonable alignment score. However, if an alignment goes wrong, there should exist some bad alignment pairs which have relatively lower scores compared to the good ones. We can reject those samples by finding bad alignment pairs so that we can improve the precision of our model as well as give an explanation on why our model makes a unreliable prediction.

In this paper, we propose to use a simple work to identify the bad alignment pairs: We first find the max score over all possible alignment pairs for a sample, then for each alignment pair in the predicted alignment, we calculate the worst alignment gap (WAG) . If is beyond some threshold, it may indicate that alignment pair is not reliable.111The reason we look at differences from the max alignment is to calibrate the scores based on what “typical” scores look like for that instance.

Figure 5: The F1-coverage curve of our model compared with BERT QA. If our model can choose to answer only the

percentage of examples it’s most confident about (the coverage), what F1 does it achieve? For our model, the confidence is represented by our “worst alignment gap” metric. For BERT, the confidence is represented by the posterior probability.

Comparison to BERT

desai2020calibration show that pre-trained transformers like BERT are well-calibrated among a range of tasks. Since we are rejecting the unreliable predictions to improve the precision of the model, to make a fair comparison, we reject the same number of examples using the posterior probability of the BERT QA predictions. To be specific, we rank the predictions of all examples by the sum of start and end posterior probabilities and compute the F1 score on the top predictions.

Figure 6: Examples of alignment of our model on addOneSent. The numbers are the actual alignment scores of the model’s output. The dashed arrows denote the unreliable alignment, the bold arrows denote the alignment that contribute most to the model’s prediction.

5.1 Results on Constrained Alignment

On Adversarial SQuAD, the confidence scores of a normal BERT QA model do not align with its performance.

From Figure 5, we find that the more confident BERT is (i.e., in low coverage settings), the worse its performance. One possible explanation of this phenomenon is that BERT overfits to the pattern of lexical overlap, and is actually most confident when highly fitting adversarial examples show up.

Hard entity constraints improve the precision but are not flexible.

Figure 5 also shows that by adding hard entity constraint, we achieve a 67.5 F1 score which is an 11.3 improvement over the unconstrained model at a cost of only 60% of samples being covered. Under the hard entity constraint, the model is not able to align to the nodes in the adversarial sentence, but the performance is still lower than what it achieves on normal SQuAD. We examine some of the error cases and find that for a certain amount of samples, there is no path from the node satisfying the constraint to the node containing the answer (e.g. they hold a more complex discourse relation while we only consider coreference as cross-sentence relation). In such cases, we will never be able to find the answer through the hard entity match.

Smaller worst alignment gap indicates better performance.

As opposed to BERT, our alignment score is well calibrated on those adversarial examples. This substantiates our claim that those learned alignment scores are good indicators of how trustful alignment pairs are. Also, we see that when the coverage is the same as the entity constraint, the performance under the alignment score constraint is even better. This demonstrates that the alignment constraints is flexible and easy to apply yet effective.

Natural Question NewsQA BioASQ TBQA
ans in wh F1 ans in wh F1 ans in wh F1 ans in wh F1
local t + local inf 63.1 56.5 40.0 37.8 54.9 42.4 22.7 21.3
local t + global inf 61.5 55.2 42.3 39.9 54.5 41.6 29.1 26.6
Subpart Alignment 60.8 55.0 48.3 45.1 64.2 49.4 30.8 27.8
BERT QA 55.4 48.5 53.4 25.3
Table 4: The performance of our proposed model on several out-of-domain datasets from the MRQA shared task Fisch et al. (2019). Compared to SQuAD in-domain, where our model is 6 F1 lower than BERT, global training and inference helps our model achieve nearly similar aggregate performance across different domains.

5.2 Case study on Alignment Scores

In the above experiments, we show that the alignment scores are helpful to control the behavior of our model. In this section, we give several examples of the alignment and demonstrate how those scores can act as an explanation to the model’s behavior. Those examples are shown in Figure 6. Here are some characteristics of those alignments:

The model’s behavior is highly affected by some overconfident alignment.

As shown by the dashed arrows, all adversarial alignments contain at least one unreliable alignment with relatively lower alignment scores. This happens because the model is overconfident towards the other alignments with a high lexical overlap as shown by the bold arrows. With those scores, it is easy for us to interpret our model’s behavior. For instance, in example (a), it is because the predicate alignment causes Luther’s 95 These to have no choice but align to Jeff Dean which is totally unrelated.

Predicate alignment is not as informative as the argument alignment.

From the samples, the predicate alignments all get a reasonable score, especially for those that have a exact match. However, there may be many predicates or phrases that express the similar meaning. In this sense, the predicate alignment learned on normal SQuAD is not reliable. To further improve the quality of predicate alignment, either a more powerful training set or a separate predicate alignment module is needed.

Note that it is because we have those alignments over the sub-parts of a question that we can inspect the model’s behavior in a way that the normal BERT QA model can not.

6 Cross-Domain Performance

We also test the performance on several cross-domain datasets, namely Natural Question Kwiatkowski et al. (2019), NewsQA Trischler et al. (2017), BioASQ and TextbookQA Kembhavi et al. (2017), picked from the MRQA shared task Fisch et al. (2019) and the results are shown in Table 4. Of particular note is that although our model does worse than BERT on SQuAD (Table 1), its performance is more similar to BERT’s in other domains, even without the addition of any constraints. Also, we consistently see improvements from global training and global inference, except on the Natural Questions dataset. We also note that a main cause of performance drop is actual answer extraction. On BioASQ, for example, we find a argument containing the answer nearly 64% of the time, but the answer extraction module fails because the types of answers are significantly different than those in SQuAD. We believe this module could be adapted more.

7 Related Work

Adversarial Attacks in NLP.

Adversarial attacks on a wide range of NLP tasks has been increasingly studied in recent years. These may take the form of challenge sets like adversarial SQuAD Jia and Liang (2017) or attacks like the universal adversarial triggers Wallace et al. (2019). A separate line of work focuses on enumerating a space of sentence perturbations and searching over this space adversarially: for example, ribeiro2018semantically propose deriving transformation rules, ebrahimi2018hotflip use character-level flips, and iyyer2018adversarial use controlled paraphrase generation. The more highly structured nature of our approach makes it naturally more robust to such attakcs.

Neural module networks.

Neural module networks are a class of models that decompose a task into several sub-tasks (sub-module), which make the model more robust and interpretable Andreas et al. (2016); Hu et al. (2017); Cirik et al. (2018); Hudson and Manning (2018); Jiang and Bansal (2019). While our work models QA as a collection of alignment decisions, this differs from module networks in that their sub-modules are often learned end-to-end while our alignment module is trained in a structured prediction framework, which makes our alignment module more flexible and controllable.

Unanswerable questions

Our approach rejects some questions as unanswerable. This is similar to the idea of unanswerable questions in SQuAD 2.0 Rajpurkar et al. (2018), which have been studied in other systems Hu et al. (2019). However, techniques to reject these questions, which are not adversarial in nature, differ substantially from ours, and the setting we consider is more challenging as we do not assume access to such questions at training time.

Graph alignment

khashabi2018question propose to answer questions through a similar graph alignment using a wide range of semantic abstractions of the text, with ILP-based inference to find the optimal graph alignment. Our model differs in two ways: (1) Our alignment model is trained end-to-end while their system mainly uses off-the-shelf, general-purpose natural language modules. (2) Our alignment is formed as node pair alignment rather than finding the optimal sub-graph, and is significantly more flexible. sachan2015learning,sachan2016machine propose to use a latent alignment structure most similar to ours; however, our model is quite different from theirs and our alignment is also more flexible.

Decomposing Questions

Past work has decomposed complex questions to answer them more effectively Talmor and Berant (2018); Min et al. (2019); Perez et al. (2020). wolfson2020break further introduce a Question Decomposition Meaning Representation (QDMR) to explicitly model this process. However, the questions they answer, such as those from HotpotQA Yang et al. (2018), are fundamentally designed to be multi-part and so are easily decomposed, whereas the questions we consider are not. Our work focuses on the robustness, controllability and explanability, and our model theoretically could be extended to leverage these question decomposition forms as well.


This work was partially supported by NSF Grant IIS-1814522, a gift from Arm, and an equipment grant from NVIDIA. The authors acknowledge the Texas Advanced Computing Center (TACC) at The University of Texas at Austin for providing HPC resources used to conduct this research. Thanks to Livio Baldini Soares and Daniel Andor for helpful comments on this work.


  • Andreas et al. (2016) Jacob Andreas, Marcus Rohrbach, Trevor Darrell, and Dan Klein. 2016. Neural module networks. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    , pages 39–48.
  • Banarescu et al. (2013) Laura Banarescu, Claire Bonial, Shu Cai, Madalina Georgescu, Kira Griffitt, Ulf Hermjakob, Kevin Knight, Philipp Koehn, Martha Palmer, and Nathan Schneider. 2013. Abstract meaning representation for sembanking. In Proceedings of the 7th linguistic annotation workshop and interoperability with discourse, pages 178–186.
  • Chen and Durrett (2019) Jifan Chen and Greg Durrett. 2019. Understanding dataset design choices for multi-hop reasoning. NAACL.
  • Cirik et al. (2018) Volkan Cirik, Taylor Berg-Kirkpatrick, and Louis-Philippe Morency. 2018. Using syntax to ground referring expressions in natural images. In

    Thirty-Second AAAI Conference on Artificial Intelligence

  • Desai and Durrett (2020) Shrey Desai and Greg Durrett. 2020. Calibration of pre-trained transformers. arXiv preprint arXiv:2003.07892.
  • Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186.
  • Ebrahimi et al. (2018) Javid Ebrahimi, Anyi Rao, Daniel Lowd, and Dejing Dou. 2018. Hotflip: White-box adversarial examples for text classification. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 31–36.
  • Fisch et al. (2019) Adam Fisch, Alon Talmor, Robin Jia, Minjoon Seo, Eunsol Choi, and Danqi Chen. 2019. Mrqa 2019 shared task: Evaluating generalization in reading comprehension. arXiv preprint arXiv:1910.09753.
  • Hu et al. (2018) Minghao Hu, Yuxing Peng, Zhen Huang, Xipeng Qiu, Furu Wei, and Ming Zhou. 2018. Reinforced mnemonic reader for machine reading comprehension. In Proceedings of the 27th International Joint Conference on Artificial Intelligence, pages 4099–4106. AAAI Press.
  • Hu et al. (2019) Minghao Hu, Furu Wei, Yu xing Peng, Zhen Xian Huang, Nan Yang, and Ming Zhou. 2019. Read + Verify: Machine Reading Comprehension with Unanswerable Questions. In Thirty-Third AAAI Conference on Artificial Intelligence.
  • Hu et al. (2017) Ronghang Hu, Jacob Andreas, Marcus Rohrbach, Trevor Darrell, and Kate Saenko. 2017. Learning to reason: End-to-end module networks for visual question answering. In Proceedings of the IEEE International Conference on Computer Vision, pages 804–813.
  • Hudson and Manning (2018) Drew A Hudson and Christopher D Manning. 2018. Compositional attention networks for machine reasoning. arXiv preprint arXiv:1803.03067.
  • Iyyer et al. (2018) Mohit Iyyer, John Wieting, Kevin Gimpel, and Luke Zettlemoyer. 2018. Adversarial example generation with syntactically controlled paraphrase networks. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1875–1885.
  • Jain and Wallace (2019) Sarthak Jain and Byron C. Wallace. 2019. Attention is not explanation. NAACL.
  • Jia and Liang (2017) Robin Jia and Percy Liang. 2017. Adversarial examples for evaluating reading comprehension systems. In

    Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

    , pages 2021–2031.
  • Jiang and Bansal (2019) Yichen Jiang and Mohit Bansal. 2019. Self-assembling modular networks for interpretable multi-hop reasoning. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 4464–4474.
  • Joshi et al. (2020) Mandar Joshi, Danqi Chen, Yinhan Liu, Daniel S Weld, Luke Zettlemoyer, and Omer Levy. 2020. Spanbert: Improving pre-training by representing and predicting spans. Transactions of the Association for Computational Linguistics, 8:64–77.
  • Kaushik and Lipton (2018) Divyansh Kaushik and Zachary C Lipton. 2018. How much reading does reading comprehension require? a critical investigation of popular benchmarks. EMNLP.
  • Kembhavi et al. (2017) Aniruddha Kembhavi, Minjoon Seo, Dustin Schwenk, Jonghyun Choi, Ali Farhadi, and Hannaneh Hajishirzi. 2017. Are you smarter than a sixth grader? textbook question answering for multimodal machine comprehension. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4999–5007.
  • Khashabi et al. (2018) Daniel Khashabi, Tushar Khot, Ashish Sabharwal, and Dan Roth. 2018. Question answering as global reasoning over semantic abstractions. In Thirty-Second AAAI Conference on Artificial Intelligence.
  • Kingma and Ba (2014) Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
  • Kwiatkowski et al. (2019) Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, et al. 2019. Natural questions: a benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7:453–466.
  • Lee et al. (2019) Seanie Lee, Donggyu Kim, and Jangwon Park. 2019. Domain-agnostic question-answering with adversarial training. arXiv preprint arXiv:1910.09342.
  • Min et al. (2019) Sewon Min, Victor Zhong, Luke Zettlemoyer, and Hannaneh Hajishirzi. 2019. Multi-hop reading comprehension through question decomposition and rescoring. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 6097–6109.
  • Miyato et al. (2016) Takeru Miyato, Andrew M Dai, and Ian Goodfellow. 2016. Adversarial training methods for semi-supervised text classification.
  • Mudrakarta et al. (2018) Pramod Kaushik Mudrakarta, Ankur Taly, Mukund Sundararajan, and Kedar Dhamdhere. 2018. Did the Model Understand the Question? In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers).
  • Niven and Kao (2019) Timothy Niven and Hung-Yu Kao. 2019.

    Probing neural network comprehension of natural language arguments.

    In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4658–4664.
  • Palmer et al. (2005) Martha Palmer, Daniel Gildea, and Paul Kingsbury. 2005. The Proposition Bank: An Annotated Corpus of Semantic Roles. Comput. Linguist., 31(1):71–106.
  • Pereyra et al. (2016) Gabriel Pereyra, George Tucker, Jan Chorowski, Lukasz Kaiser, and Geoffrey Hinton. 2016. Regularizing neural networks by penalizing confident output distributions.
  • Perez et al. (2020) Ethan Perez, Patrick A. Lewis, Wen tau Yih, Kyunghyun Cho, and Douwe Kiela. 2020. Unsupervised Question Decomposition for Question Answering. In arXiv.
  • Peters et al. (2018) Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. NAACL.
  • Rajpurkar et al. (2018) Pranav Rajpurkar, Robin Jia, and Percy Liang. 2018. Know what you don’t know: Unanswerable questions for squad. arXiv preprint arXiv:1806.03822.
  • Rajpurkar et al. (2016) Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. Squad: 100,000+ questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2383–2392.
  • Ribeiro et al. (2018) Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2018. Semantically equivalent adversarial rules for debugging nlp models. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 856–865.
  • Sachan et al. (2015) Mrinmaya Sachan, Kumar Dubey, Eric Xing, and Matthew Richardson. 2015. Learning answer-entailing structures for machine comprehension. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 239–249.
  • Sachan and Xing (2016) Mrinmaya Sachan and Eric Xing. 2016. Machine comprehension using rich semantic representations. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 486–492.
  • Shi and Lin (2019) Peng Shi and Jimmy Lin. 2019. Simple bert models for relation extraction and semantic role labeling. arXiv preprint arXiv:1904.05255.
  • Talmor and Berant (2018) Alon Talmor and Jonathan Berant. 2018. The web as a knowledge-base for answering complex questions. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 641–651.
  • Trischler et al. (2017) Adam Trischler, Tong Wang, Xingdi Yuan, Justin Harris, Alessandro Sordoni, Philip Bachman, and Kaheer Suleman. 2017. Newsqa: A machine comprehension dataset. ACL 2017, page 191.
  • Wallace et al. (2019) Eric Wallace, Shi Feng, Nikhil Kandpal, Matt Gardner, and Sameer Singh. 2019. Universal adversarial triggers for attacking and analyzing NLP. In Empirical Methods in Natural Language Processing.
  • Wang and Jiang (2018) Chao Wang and Hui Jiang. 2018. Explicit utilization of general knowledge in machine reading comprehension. arXiv preprint arXiv:1809.03449.
  • Wang and Bansal (2018) Yicheng Wang and Mohit Bansal. 2018. Robust machine comprehension models via adversarial training. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pages 575–581.
  • Welbl et al. (2020) Johannes Welbl, Pasquale Minervini, Max Bartolo, Pontus Stenetorp, and Sebastian Riedel. 2020. Undersensitivity in neural reading comprehension. arXiv preprint arXiv:2003.04808.
  • Wolfson et al. (2020) Tomer Wolfson, Mor Geva, Ankit Gupta, Matt Gardner, Yoav Goldberg, Daniel Deutch, and Jonathan Berant. 2020. Break it down: A question understanding benchmark. Transactions of the Association for Computational Linguistics, 8:183–198.
  • Yang et al. (2018) Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W Cohen, Ruslan Salakhutdinov, and Christopher D Manning. 2018. HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering. EMNLP.
  • Yang et al. (2019) Ziqing Yang, Yiming Cui, Wanxiang Che, Ting Liu, Shijin Wang, and Guoping Hu. 2019. Improving machine reading comprehension via adversarial training. arXiv preprint arXiv:1911.03614.
  • Zhou et al. (2019) Mantong Zhou, Minlie Huang, and Xiaoyan Zhu. 2019. Robust reading comprehension with linguistic constraints via posterior regularization. arXiv preprint arXiv:1911.06948.