SG-Net: Syntax-Guided Machine Reading Comprehension

For machine reading comprehension, how to effectively model the linguistic knowledge from the detail-riddled and lengthy passages and get ride of the noises is essential to improve its performance. In this work, we propose using syntax to guide the text modeling of both passages and questions by incorporating syntactic clues into multi-head attention mechanism to fully fuse information from both global and attended representations. Accordingly, we present a novel syntax-guided network (SG-Net) for challenging reading comprehension tasks. Extensive experiments on popular benchmarks including SQuAD 2.0 and RACE validate the effectiveness of the proposed method with substantial improvements over fine-tuned BERT. This work empirically discloses the effectiveness of syntactic structural information for text modeling. The proposed attention mechanism also verifies the practicability of using linguistic information to guide attention learning and can be easily adapted with other tree-structured annotations.


page 1

page 2

page 3

page 4

page 5

page 6

page 7

page 8


SG-Net: Syntax Guided Transformer for Language Representation

Understanding human language is one of the key themes of artificial inte...

Beyond English-only Reading Comprehension: Experiments in Zero-Shot Multilingual Transfer for Bulgarian

Recently, reading comprehension models achieved near-human performance o...

VAULT: VAriable Unified Long Text Representation for Machine Reading Comprehension

Existing models on Machine Reading Comprehension (MRC) require complex m...

Probing Prior Knowledge Needed in Challenging Chinese Machine Reading Comprehension

With an ultimate goal of narrowing the gap between human and machine rea...

Convolutional Spatial Attention Model for Reading Comprehension with Multiple-Choice Questions

Machine Reading Comprehension (MRC) with multiple-choice questions requi...

What does BERT Learn from Multiple-Choice Reading Comprehension Datasets?

Multiple-Choice Reading Comprehension (MCRC) requires the model to read ...

Discourse-Aware Semantic Self-Attention for Narrative Reading Comprehension

In this work, we propose to use linguistic annotations as a basis for a ...


Recently, much progress has been made in general-purpose language modeling that can be used across a wide range of tasks [22, 23, 4]. Understanding the meaning of a sentence is a prerequisite to solve many natural language understanding (NLU) problems, such as reading comprehension based question answering [24]. Obviously, it requires a good representation of the meaning of a sentence.

Human reads most words superficially and pays more attention to the key ones during reading and understanding sentences [34]

, that is inconsistent with the way in which most of the existing models do. These models equally tackle each word in a sentence without distillation. Specifically, if the text is particularly lengthy and detailed-riddled, it would be quite difficult for a deep learning based model to understand as it suffers from noise and pays vague attention on the text components

[20], let alone accurately answering questions. In contrast, extensive studies have verified that human reads sentences efficiently by taking a sequence of fixation and saccades and prefers to re-read the long texts such as passages after a quick first glance [38].

Though a sort of attention mechanisms tried to weigh out the important parts of an input sequence, they still do calculations for each word without explicit pruning and prior focus. For passage involved reading comprehension, the input sequence always consists of multiple sentences. Nearly all of the current attention methods regard the sentences as a whole, e.g. a passage, with little consideration of the inner linguistic structure inside each sentence. This would result in process bias caused by much noise and lack of associated spans for each concerned word.

Figure 1: Example of span-based QA. Our proposed model is capable of capturing syntactic information provided by sentence parser, which is very helpful for predicting the final answer. The main head words in the sentences are shown in red and words between left and right arrows are child nodes of the head word. The DOI of each word is a set of its head (parent) words.

All these factors motivate us to seek for an informative method that can selectively pick out important words and linguistically distinguish every word, instead of taking all the words, only considering the related subset of words of syntactic importance inside each input sentence explicitly with a guidance of syntactic structure clues to give more accurate attentive signals and reduce the impact of noise brought about by the whole passage.

In this paper, we introduce syntax to guide machine reading comprehension by building a fine-grained structured representation for each sentence to improve language modeling and further propose a syntax-guided network for machine reading comprehension. Specifically, we adopt pre-trained dependency syntactic parse tree structure to produce the related nodes for each word in a sentence, namely dependency of interest (DOI), by regarding each word as a child node and the DOI is the set of its parent nodes in the dependency parsing tree. An example is shown in Figure 1. Then we propose syntax-guided self attention to incorporate the syntactic tree structure information and explore ways of aggregation to form syntax-enhanced representations. Our evaluations are based on two widely used challenging MRC tasks, span-based SQuAD 2.0 and multi-choice style RACE.

To our best knowledge, we are the first to integrate syntactic relationship as attentive guidance for machine reading comprehension and propose a general syntax-guided structure for deep aggregation. A series of experiments and analysis show the proposed method is effective and boosts the strong BERT baseline substantially.

Related Work

Machine Reading Comprehension

In the last decade, the MRC tasks have evolved from the early cloze-style test [10, 9, 40] to span-based answer extraction from passage [25, 21, 12, 37, 24] and multi-choice style ones [15]

where the two latter ones are our focus in this work. A wide range of attentive models have been employed, including Attention Sum Reader

[13], Gated attention Reader [5], Self-matching Network [36], Attention over Attention Reader [3] and Bi-attention Network [28]. Meanwhile, researchers are also investigating models with more complex context understanding[35, 37, 42, 41].

Recently, deep contextual language model has shown effective for learning universal language representations by leveraging large amounts of unlabeled data, achieving various state-of-the-art results in a series of NLU benchmarks. Some prominent examples are Embedding from Language models (ELMo) [22], Generative Pre-trained Transformer (OpenAI GPT) [23] and Bidirectional Encoder Representations from Transformers (BERT) [4] among which BERT uses a different pre-training objective, masked language model, which allows capturing both sides of context, left and right. Besides, BERT also introduces a next sentence prediction task that jointly pre-trains text-pair representations. The latest evaluation shows that BERT is powerful and convenient for downstream tasks, which could be either easily applied to downstream models as the encoder or directly used for fine-tuning. Following this line, we extract context-sensitive syntactic features and take pre-trained BERT as our backbone encoder for empowering explicit syntactic dependencies.

Figure 2: Overview of Syntax-Guided Network.

Syntactic Structures

Recently, dependency syntactic parsing have been further developed with neural network and attained new state-of-the-art results

[2, 6, 17]. Benefiting from the highly accurate parser, neural network models could enjoy even higher accuracy gains by leveraging syntactic in-formation rather than ignoring it [27, 7, 18].

Syntactic dependency parse tree provides a form that is capable of indicating the existence and type of linguistic dependency relation among words, which has been shown generally beneficial in various natural language understanding tasks [1]

. To effectively exploit syntactic clue, most of previous works absorb parse tree information by transforming dependency labels into vectors and simply concatenate the label embedding with word representation. However, such simplified and straightforward processing would result in higher dimension of joint word and label embeddings and is too coarse to capture contextual interactions between the associated labels and the mutual connections between labels and words. This inspires us to seek for an attentive way to enrich the contextual representation from the syntactic source. A similar work is from

[29], which proposed to incorporate syntax with multi-task learning for semantic role labeling. However, their syntax is incorporated by training one extra attention head to attend to syntactic parents for each token. Instead of only considering syntactic head nodes, we propose to fully use all the DOI parts and we extend the syntax-guided representation to conjunct with original contextual word embedding via a bi-attention. Thus, we form a general approach to benefit from syntax-guided representations, which is the first attempt for machine reading comprehension to our best knowledge.

Syntax-Guided Network

Our goal is to design an effective neural network model which makes use of linguistic information as effectively as possible in order to perform end-to-end MRC. We first present the general syntax-guided attentive architecture, building upon the recent advanced BERT111Note that our method is not limited to cooperate with BERT. We use it as the backbone because of the superior representation capacity and the strong baseline it provides. and then fit with task-specific layers for machine reading comprehension tasks.

Figure 2 depicts the whole architecture of our model. The basis for our model is the fine-tuned BERT encoder introduced from [4]. We transform BERT embeddings into conditionally attended representation using our proposed syntax-guided self attention by only considering the dependency of interest (DOI) when calculating the weights for each token, which is intuitively a constrained and fine-grained interaction inside each sentence while the vanilla self attention is a kind of global processing by leveraging all the input tokens. Then we integrate the outputs from the global and conditional attention learning utilizing a bi-attention layer. At last, the resulting syntax-enhanced representation is passed to task-specific layers for final predictions.

BERT Encoder

Following the implementation of BERT [4], the first token of every sequence is the special token [CLS] and the sequences are separated by the [SEP] token. The output of BERT is then fed to our proposed syntax-guided attention layers to obtain the syntax-enhanced representation. We omit rather extensive formulations of BERT and recommend readers to get the details from [4].

Syntax-Guided Network

Our syntax-guided representation is obtained by two steps222For brevity, we focus on a particular layer though we may use multiple layers which are stacked in the same way.. Firstly, we pass the encoded representation to a syntax-guided self attention layer and a vanilla multi-head attention layer in parallel 333Both vanilla and syntax-guided self-attention layers are extra parts after BERT 24-layer-encoder. We tried to apply DOIMASK to the 24-layer BERT but it only shows MARGINAL improvement. Thus we keep the whole 24-layer BERT encoder as it is and add two extra parallel parts followed by an aggregating part.. The corresponding outputs are then aggregated using a bi-attention network to form a syntax-enhanced representation for downstream tasks. Specifically, the syntax-guided self attention is designed to incorporate the syntactic tree structure information inside a multi-head attention mechanism to indicate the token relationships of each sentence which will be demonstrated as follows.

Syntax-Guided Self Attention Layer

In this work, we first pre-train a syntactic dependency parser to annotate the dependency structures for each sentence which are then fed to SG-Net as guidance of token-aware attention. Details of the pre-training process of the parser are reported in Section Implementation.

Figure 3: Syntax-guided self attention.

To use the relationship between head word and dependent words provided by the syntactic dependency tree of sentence, we restrain the scope of attention only between word and all of its parent head words. In other word, we would like to have each word only attend to words of syntactic importance in a sentence, the parent head words in the view of the child word. As shown in Figure 3, instead of taking attention with each word in whole passage, the word credit only makes attention with its parent head words reflects and losses and itself in this sentence, which means that the DOI of credit contains reflects, losses along with itself.

To incorporate syntax, we introduce a syntax-guided multi-head self attention layer after the BERT encoder to attend to each token’s syntactically related nodes, allowing the model to use the attention head as an oracle for syntactic dependencies.

Specifically, given input token sequence , we first use our parser to generate a sequence of index span of each token’s subtree , where . indicates the span of child nodes for head word . To utilize the information provided by relationship between dependent token and parent tokens, we use to build the DOI for each word using a dependency of interest mask DOIMASK of dimension :


Obviously, if , it means that token is the parent node of token .

We then project the last layer output from the vanilla BERT into distinct key, value and query representations of dimensions , and , respectively, denoted for each head . Then we perform a dot product to score key-query pairs with the dependency of interest mask to obtain attention weights of dimension , denoted :


We then multiply attention weight by to obtain the syntax-guided token representations:


Then for all heads are concatenated and passed through a feed-forward layer followed by GeLU activations [8]. After passing through another feed-forward layer, we apply a layer normalization to the sum of output and initial representation to obtain the final representations.

Attention Aggregation Layer

To integrate syntax-guided token representations, we first feed the last layer of output of vanilla BERT into a multi-head self attention layer [32] to obtain self-attended token representations and utilize a bi-attention network [28] to update representations with syntax-guided representations :


Task-specific Adaptation

We focus on two types of reading comprehension tasks, span-based and multi-choice style which can be described as a tuple or respectively, where is a passage (context), is a query over the contents of , in which a span or choice is the right answer . For the span-based one, we implemented our model on SQuAD 2.0 task that contains unanswerable questions. Our system is supposed to not only predict the start and end position in the passage and extract span as answer but also return a null string when the question is unanswerable. For the multi-choice style, the model is implemented on RACE dataset which is requested to choose the right answer from a set of candidate ones according to given passage and question.

Here, we formulate our model for both of the two tasks and feed the output from the syntax-guided network to task layers according to specific task.

Given the passage and a question and the choice specially for RACE, we organize the input for BERT as the following two sequences.

where denotes concatenation.

The sequence is fed to BERT encoder mentioned above to obtain the contextualized representation H which is then passed it to our proposed syntax-guided self attention layer described in Syntax-Guided Network to obtain the syntax-enhanced representation . For span-based task, we fed

to Pointer Network and a no-answer (NA) Verifier. For multi-choice task, we fed it into the classifier to predict the choice label for the multi-choice model.

SQuAD 2.0

To enable fine-tuned BERT to competently work for SQuAD 2.0, we extend BERT by adding a Pointer Network and an NA Verifier for no-answer predictions. Our aim is a span of answer text and the prediction of the end position is dependent on the start position, thus we employ a Pointer Network [33] and fed

as the input to obtain the start and end probabilities,

and , similar to the usage in various former MRC models such as R-Net [36]:


For SQuAD 2.0 challenge, our model must not only answer questions when possible, but also determine when no answer is supported by the paragraph and abstain from answering, we also need a tactic to give no-answer predictions.

We pass and

to softmax layer and employ batch matrix multiplication with

respectively to obtain weighted hidden representation. Meanwhile, we apply a first token pooling

444The target answer for unanswerable question is set as the first token “[CLS]” during training, so we use the first token representation to make following no-answer prediction. on . Then we sum the above two weighted representations and concatenate with the pooled representation to fed to a lineal layer to make the prediction that the question is answerable or not.

Model Test
Regular Track
BiDAF-No-Answer [24] 59.2 62.1
DocQA [24] 59.3 62.3
DocQA + ELMo [24] 63.4 66.3
Joint SAN [16] 68.7 71.4
U-Net [30] 69.2 72.6
RMR + ELMo + Verifier [11] 71.7 74.2
SLQA+ 71.5 74.4
BERT Track
Human 86.8 89.5
BERT + DAE + AoA 85.9 88.6

BERT + N-Gram Masking + Synthetic Self-Training

85.2 87.7
BERT + ConvLSTM + MTL + Verifier 84.9 88.2
SemBERT 84.8 87.9
Insight-baseline-BERT 84.8 87.6
BERT + MMFT + ADA 83.0 85.9
BERT 82.1 84.8
SG-Net 85.2 87.9
Table 1: Exact Match (EM) and F1 scores (%) on SQuAD 2.0 dataset for single models. Our model is in boldface. refers to unpublished work. Besides published works, we also list competing systems on the SQuAD leaderboard.

The training objectives of our SQuAD model can be comprised by two parts, , where denotes the cross entropy loss for the start and end predictions and is the binary cross entropy loss for no-answer prediction, and

are hyper-parameter for weights of two loss functions.


where and are the targeted start and end positions, respectively. denotes the ground truth label of no-answer prediction.

For prediction, given output start and end probabilities and and no-answer probability , we calculate the has-answer score and the no-answer score , where

are coefficients. We obtain a difference score between has-answer score and the no-answer score as final score. A threshold is set to determine whether the question is answerable, which can be heuristically computed in linear time with dynamic programming. The model predicts the answer span that gives the has-answer score if the final score is above the threshold, and null string otherwise.


As discussed in [4], the pooled representation explicitly includes classification information during the pre-training stage of BERT, we expect the pooled to be overall representation of the input. Thus, the first token representation in is picked out and passed to a feed-forward layer to give the prediction . For each instance with choice candidates, we update model parameters according to cross-entropy loss during training and choose the one with highest probability as the prediction when testing. The training objectives of our RACE model is defined as, where denotes the prediction, is the target and denotes the data index.


Dataset and Setup

Span-based MRC

As a widely used MRC benchmark dataset, SQuAD 2.0 [24] combines the 100,000 questions in SQuAD 1.1 [25]

with over 50,000 new, unanswerable questions that are written adversarially by crowdworkers to look similar to answerable ones. For the SQuAD 2.0 challenge, systems must not only answer questions when possible, but also abstain from answering when no answer is supported by the paragraph. Two official metrics are selected to evaluate the model performance: Exact Match (EM) and a softer metric F1 score, which measures the weighted average of the precision and recall rate at a character level.

Multi-choice MRC

Our Multi-choice MRC is evaluated on Large-scale ReAding Comprehension Dataset From Examinations (RACE) dataset [15]

, which consists of two subsets: RACE-M and RACE-H corresponding to middle school and high school difficulty levels. RACE contains 27,933 passages and 97,687 questions in total, which is recognized as one of the largest and most difficult datasets in multi-choice MRC. The official evaluation metric is accuracy.


For the syntactic parser, we follow the deep biaffine attention based dependency parser from [6] using English dataset Penn Treebank (PTB) [19] to annotate our task datasets. We re-train the dependency parser by joint learning of constituent parsing [14] using BERT as sole input which achieves very high accuracy: 97.00% UAS and 95.43% LAS on PTB test set555We report the results without punctuation of the labeled and unlabeled attachment scores (LAS, UAS). Note this work is done in data preprocessing and our parser is not updated with the following MRC models.

After generating the dependency syntactic tree for each sentence, we use a simple depth-first search algorithm to calculate the left and right boundaries of each sub-tree as the index span of each node’s children. Moreover, we convert the index span to attention mask which is convenient for range restriction of self attention layers.

Human Performance
Turkers 85.1 69.4 73.3
Ceiling 95.4 94.2 94.5
DCMN 77.6 70.1 72.3
BERT 76.6 70.1 72.0
OCN 76.7 69.6 71.7
RSM 69.2 61.5 63.8
GPT 62.9 57.4 59.0
SG-Net 78.8 72.2 74.2
Table 2: Accuracy (%) on RACE test set for single models.

For MRC model implementation, We follow the same fine-tuning procedure as BERT to avoid extra influence and focus on the intrinsic performance of our newly proposed method. We use the pre-trained cased BERT (whole word masking) as the backbone. We use Adam as our optimizer with an initial learning rate in {8e-6, 1e-5, 2e-5, 3e-5} with warm-up rate of 0.1 and L2 weight decay of 0.01. The batch size is selected in {16, 20, 32}. The maximum number of epochs is set to 3 or 10 depending on tasks. All the texts are tokenized using wordpieces, the maximum length is set to 384 for both of SQuAD and RACE. And the configuration for multi-head self attention is same as that for BERT. For SQuAD, coefficients are set as

, , . .

Main Results

To focus on the evaluation of syntactic advance and keep simplicity, we only compare with single models instead of ensemble ones.

Reading Comprehension

Table 1 shows the result on SQuAD 2.0. Various state of the art models from the official leaderboard are also listed for reference. We can see that the performance of BERT is very strong. However, our model is more powerful, boosting the BERT baseline essentially with an increase of +3.1% EM and F1 on the test set. It also outperforms all the published works and achieves the comparative performance compared with more sophisticated mechanisms and pipeline systems.

For RACE, we compare our model with the following latest baselines: Dual Co-Matching Network (DCMN) [39], Option Comparison Network (OCN) [26] , Reading Strategies Model (RSM) [31] and Generative Pre-Training (GPT) [23]. Table 2 shows the result666Our concatenation order of and is slight different from the original BERT, which yields about 0.5%-1% accuracy improvement. Therefore, the result of our BERT baseline is higher than the public one on the leaderboard, thus our improved BERT implementation is used as the stronger baseline for our evaluation.. Turkers is the performance of Amazon Turkers on a random subset of the RACE test set. Ceiling is the percentage of unambiguous questions in the test set. From the comparison, we can observe that our model outperforms all baselines and achieves new state-of-the-art accuracy, which verifies the effectiveness of our proposed syntax enhancement.

Model EM F1
Our model 85.1 87.9
-Syntax-Guided Network 84.1 86.8
Concatenation 84.5 87.6
Weighted Sum 84.8 87.7
Table 3: Performance of different aggregation methods on SQuAD 2.0 dev set.

Passage:…30 minutes, the line of the Missouri Compromise… Question:What was the name of the legislation passed in 1850? Answer:the Missouri Compromise

Figure 4: Visualization of the vanilla self attention (left) and syntax-guided self attention (right). Weights of attention are selected from first head of the last attention layer. For the syntax-guided self attention, the columns with weights represent the DOI for each word in the row. For example, the DOI of passed contains {name, of, legislation, passed}. The weights are normalized by SoftMax to sum up to 1 for each row.

Model Analysis

Syntax-Guided Attention

Table 3 shows a comparative study to evaluate the attention mechanism in our method. We observe that the performance drops dramatically without our proposed syntax-guided network. In SG-Net, we integrate the representations from syntax-guided attention layer and the vanilla self attention layer using a bi-attention. Actually, there are other simple operations for representation aggregation, such as concatenation and weighted sum, which are also involved in our comparison, which shows that using bi-attention is better for the final advance.

Figure 5: Case study of the number of attention layers on SQuAD 2.0 dev set.


To have an insight that how syntax-guided attention works, we draw attention distributions of the vanilla self attention and our proposed syntax-guided self attention777

Since special symbols such as [PAD] and [CLS] are not considered in the dependency parsing tree, in our implementation, the child of these tokens is only confined to themselves. So in our syntax-guided layer, these special tokens will have value of 1 as weights over themselves and these weights do not matter as we will mask paddings in the following aggregation layer.

, as shown in Figure 4. With the guidance of syntax, the keywords name, legislation and 1850 in the question are highlighted, and (the) Missouri, and Compromise in the passage are also paid great attention, which is exactly the right answer. The visualization verifies our model is effective at selecting the vital parts, guiding the downstream layer to collect more relevant pieces to make predictions.

Attention Layers

Attention layer depth might be a potential factor that contributes to model performance. To investigate the influence, we conduct a case study on the number of syntax-guided self attention layers ranging from [1,5]. The comparison is in Figure 5. We observe that 3-layer attention shows to be the best and increasing the model depth would be harmful. This shows that a moderate depth of SG attention layer would be helpful.


This paper presents a novel syntax-guided framework for machine reading comprehension. Experiments on two major machine reading comprehension benchmarks involving span-based answer extraction (SQuAD 2.0) and multi-choice inference (RACE) show that our model can yield new state-of-the-art or comparative results in both extremely challenging tasks. This work empirically discloses the effectiveness of syntactic structural information for text modeling. The proposed attention mechanism also verifies the practicability of using linguistic information to guide attention learning and can be easily adapted with other tree-structured annotations.


  • [1] S. R. Bowman, J. Gauthier, A. Rastogi, R. Gupta, C. D. Manning, and C. Potts (2016) A fast unified model for parsing and sentence understanding. arXiv preprint arXiv:1603.06021. Cited by: Syntactic Structures.
  • [2] D. Chen and C. Manning (2014) A Fast and Accurate Dependency Parser using Neural Networks. In

    Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)

    pp. 740–750. Cited by: Syntactic Structures.
  • [3] Y. Cui, Z. Chen, S. Wei, S. Wang, T. Liu, and G. Hu (2017) Attention-over-attention neural networks for reading comprehension. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (ACL 2017), pp. 1832–1846. Cited by: Machine Reading Comprehension.
  • [4] J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: Introduction, Machine Reading Comprehension, BERT Encoder, RACE, Syntax-Guided Network.
  • [5] B. Dhingra, H. Liu, Z. Yang, W. W. Cohen, and R. Salakhutdinov (2017) Gated-attention readers for text comprehension. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (ACL 2017), pp. 1832–1846. Cited by: Machine Reading Comprehension.
  • [6] T. Dozat and C. D. Manning (2016) Deep biaffine attention for neural dependency parsing. ICLR. Cited by: Syntactic Structures, Implementation.
  • [7] L. He, K. Lee, M. Lewis, L. Zettlemoyer, L. He, K. Lee, M. Lewis, L. Zettlemoyer, L. He, and K. Lee (2017) Deep semantic role labeling: what works and what’s next. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (ACL 2017), pp. 473–483. Cited by: Syntactic Structures.
  • [8] D. Hendrycks and K. Gimpel (2016) Bridging nonlinearities and stochastic regularizers with gaussian error linear units. ICLR. Cited by: Syntax-Guided Self Attention Layer.
  • [9] K. M. Hermann, T. Kocisky, E. Grefenstette, L. Espeholt, W. Kay, M. Suleyman, and P. Blunsom (2015) Teaching machines to read and comprehend. Advances in Neural Information Processing Systems (NIPS 2015), pp. 1693–1701. Cited by: Machine Reading Comprehension.
  • [10] F. Hill, A. Bordes, S. Chopra, and J. Weston (2015) The goldilocks principle: reading children’s books with explicit memory representations. arXiv preprint arXiv:1511.02301. Cited by: Machine Reading Comprehension.
  • [11] M. Hu, Y. Peng, Z. Huang, N. Yang, M. Zhou, et al. (2018) Read+ verify: machine reading comprehension with unanswerable questions. arXiv preprint arXiv:1808.05759. Cited by: Table 1.
  • [12] M. Joshi, E. Choi, D. S. Weld, and L. Zettlemoyer (2017) TriviaQA: a large scale distantly supervised challenge dataset for reading comprehension. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (ACL 2017), pp. 1601–1611. Cited by: Machine Reading Comprehension.
  • [13] R. Kadlec, M. Schmid, O. Bajgar, and J. Kleindienst (2016) Text understanding with the attention sum reader network. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL 2016), pp. 908–918. Cited by: Machine Reading Comprehension.
  • [14] N. Kitaev and D. Klein (2018) Constituency Parsing with a Self-Attentive Encoder. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (ACL), pp. 2676–2686. Cited by: Implementation.
  • [15] G. Lai, Q. Xie, H. Liu, Y. Yang, and E. Hovy (2017) RACE: large-scale reading comprehension dataset from examinations. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 785–794. Cited by: Machine Reading Comprehension, Multi-choice MRC.
  • [16] X. Liu, W. Li, Y. Fang, A. Kim, K. Duh, and J. Gao (2018) Stochastic answer networks for squad 2.0. arXiv preprint arXiv:1809.09194. Cited by: Table 1.
  • [17] X. Ma, Z. Hu, J. Liu, N. Peng, G. Neubig, and E. Hovy (2018) Stack-Pointer Networks for Dependency Parsing. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (ACL), pp. 1403–1414. Cited by: Syntactic Structures.
  • [18] D. Marcheggiani and I. Titov (2017) Encoding sentences with graph convolutional networks for semantic role labeling. arXiv preprint arXiv:1703.04826. Cited by: Syntactic Structures.
  • [19] M. P. Marcus, B. Santorini, and M. A. Marcinkiewicz (1993) Building a Large Annotated Corpus of English: The Penn Treebank. Computational Linguistics 19 (2). Cited by: Implementation.
  • [20] P. K. Mudrakarta, A. Taly, M. Sundararajan, and K. Dhamdhere (2018) Did the model understand the question?. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (ACL 2018), pp. 1896–1906. Cited by: Introduction.
  • [21] T. Nguyen, M. Rosenberg, X. Song, J. Gao, S. Tiwary, R. Majumder, and L. Deng (2016) MS marco: a human generated machine reading comprehension dataset. ArXiv:1611.09268v2. Cited by: Machine Reading Comprehension.
  • [22] M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer (2018) Deep contextualized word representations. The 16th Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT 2018), pp. 2227–2237. Cited by: Introduction, Machine Reading Comprehension.
  • [23] A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever (2018) Improving language understanding by generative pre-training. Technical report. Cited by: Introduction, Machine Reading Comprehension, Reading Comprehension.
  • [24] P. Rajpurkar, R. Jia, and P. Liang (2018) Know what you don’t know: unanswerable questions for SQuAD. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (ACL 2018), pp. 784–789. Cited by: Introduction, Machine Reading Comprehension, Table 1, Span-based MRC.
  • [25] P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang (2016) SQuAD: 100,000+ questions for machine comprehension of text. Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (EMNLP 2016), pp. 2383–2392. Cited by: Machine Reading Comprehension, Span-based MRC.
  • [26] Q. Ran, P. Li, W. Hu, and J. Zhou (2019) Option comparison network for multiple-choice reading comprehension. arXiv preprint arXiv:1903.03033. Cited by: Reading Comprehension.
  • [27] M. Roth and M. Lapata (2016) Neural semantic role labeling with dependency path embeddings. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vol. 1, pp. 1192–1202. Cited by: Syntactic Structures.
  • [28] M. Seo, A. Kembhavi, A. Farhadi, and H. Hajishirzi (2016) Bidirectional attention flow for machine comprehension. arXiv preprint arXiv:1611.01603. Cited by: Machine Reading Comprehension, Attention Aggregation Layer.
  • [29] E. Strubell, P. Verga, D. Andor, D. Weiss, and A. McCallum (2018) Linguistically-informed self-attention for semantic role labeling. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 5027–5038. Cited by: Syntactic Structures.
  • [30] F. Sun, L. Li, X. Qiu, and Y. Liu (2018) U-net: machine reading comprehension with unanswerable questions. arXiv preprint arXiv:1810.06638. Cited by: Table 1.
  • [31] K. Sun, D. Yu, D. Yu, and C. Cardie (2018) Improving machine reading comprehension with general reading strategies. arXiv preprint arXiv:1810.13441. Cited by: Reading Comprehension.
  • [32] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: Attention Aggregation Layer.
  • [33] O. Vinyals, M. Fortunato, and N. Jaitly (2015) Pointer networks. In Advances in Neural Information Processing Systems (NIPS 2015), pp. 2692–2700. Cited by: SQuAD 2.0.
  • [34] S. Wang, J. Zhang, and C. Zong (2017) Learning sentence representation with guidance of human attention. In Proceedings of the 26th International Joint Conference on Artificial Intelligence (IJCAI 2017), pp. 4137–4143. Cited by: Introduction.
  • [35] W. Wang, M. Yan, and C. Wu (2018) Multi-granularity hierarchical attention fusion networks for reading comprehension and question answering. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (ACL 2018), pp. 1705–1714. Cited by: Machine Reading Comprehension.
  • [36] W. Wang, N. Yang, F. Wei, B. Chang, and M. Zhou (2017) Gated self-matching networks for reading comprehension and question answering. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (ACL 2017), pp. 189–198. Cited by: Machine Reading Comprehension, SQuAD 2.0.
  • [37] Y. Wang, K. Liu, J. Liu, W. He, Y. Lyu, H. Wu, S. Li, and H. Wang (2018) Multi-passage machine reading comprehension with cross-passage answer verification. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (ACL 2018), pp. 1918–1927. Cited by: Machine Reading Comprehension.
  • [38] A. W. Yu, H. Lee, and Q. Le (2017) Learning to skim text. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (ACL 2017), pp. 1880–1890. Cited by: Introduction.
  • [39] S. Zhang, H. Zhao, Y. Wu, Z. Zhang, X. Zhou, and X. Zhou (2019) Dual co-matching network for multi-choice reading comprehension. arXiv preprint arXiv:1901.09381. Cited by: Reading Comprehension.
  • [40] Z. Zhang, Y. Huang, and H. Zhao (2018) Subword-augmented embedding for cloze reading comprehension. In Proceedings of the 27th International Conference on Computational Linguistics (COLING 2018), pp. 1802–1814. Cited by: Machine Reading Comprehension.
  • [41] Z. Zhang, J. Li, P. Zhu, and H. Zhao (2018) Modeling multi-turn conversation with deep utterance aggregation. COLING. Cited by: Machine Reading Comprehension.
  • [42] Z. Zhang and H. Zhao (2018) One-shot learning for question-answering in gaokao history challenge. COLING. Cited by: Machine Reading Comprehension.