Machine reading comprehension (MRC) refers to the task of finding answers to given questions by reading and understanding some documents. It represents a challenging benchmark task in natural language understanding (NLU). With the progress of large-scale pre-trained language models devlin2018bert, state-of-the-art MRC models ju2019technical; yang2019xlnet; lan2019albert; zhang2019sg; liu2019roberta have already surpassed human-level performance on certain commonly used MRC benchmark datasets, such as SQuAD 1.1 rajpurkar2016squad, SQuAD 2.0 rajpurkar2018know, and CoQA reddy2019coqa.
Recently, a new benchmark MRC dataset called Natural Questions222NQ provides some visual examples of the data at https://ai.google.com/research/NaturalQuestions/visualization. (NQ) kwiatkowski2019natural has presented a substantially greater challenge for the existing MRC models. Specifically, there are two main challenges in NQ compared to the previous MRC datasets like SQuAD 2.0. Firstly, instead of providing one relatively short paragraph for each question-answer (QA) pair, NQ gives an entire Wikipedia page which is significantly longer compared to other datasets. Secondly, NQ task not only requires the model to find an answer span (called short answer) to the question like previous MRC tasks but also asks the model to find a paragraph that contains the information required to answer the question (called long answer).
In this paper, we focus on the NQ task and propose a new MRC model called RikiNet tailored to its associated challenges, which Reads the Wikipedia pages for natural question answering. For the first challenge of the NQ task mentioned above, RikiNet employs the proposed Dynamic Paragraph Dual-Attention (DPDA) reader which contains multiple DPDA blocks. In each DPDA block, we iteratively perform dual-attention to represent documents and questions, and employ paragraph self-attention with dynamic attention mask to fuse key tokens in each paragraph. The resulting context-aware question representation, question-aware token-level, and paragraph-level representations are fed into the predictor to obtain the answer. The motivations of designing DPDA reader are: (a) Although the entire Wikipedia page contains a large amount of text, one key observation is that most answers are only related to a few words in one paragraph; (b) The final paragraph representation can be used naturally for predicting long answers. We describe the details of DPDA reader in § 3.1.
For the second challenge, unlike prior works on NQ dataset alberti2019bert; pan2019frustratingly that only predict the short answer and directly select its paragraph as long answer, RikiNet employs a multi-level cascaded answer predictor which jointly predict the short answer span, the long answer paragraph, and the answer type in a cascaded manner. Another key intuition motivating our design is that even if the relevant documents are not given, humans can easily judge that some questions have no short answers benjamin2019. Take this question as a motivating example:“What is the origin of the Nobel prize?” The answer should be based on a long story, which cannot be easily expressed in a short span of entities. Therefore we also feed the question representation into the predictor as an auxiliary prior to answer type prediction. The details will be given in § 3.2.
On the NQ test set, our single model obtains 74.3 F1 scores on the long-answer task (LA) and 57.9 F1 scores on the short-answer task (SA) compared to the published best single model alberti2019synthetic results of 66.8 F1 on LA and 53.9 F1 on SA. To the best of our knowledge, RikiNet is the first single model that outperforms the single human performance kwiatkowski2019natural on both LA and SA. Besides, our ensemble model obtains 76.1 F1 on LA and 61.3 F1 on SA, which achieves the best performance of both LA and SA on the official NQ leaderboard.
Before we describe our model in detail, we first introduce the notations and problem formalization. Our paper considers the following NQ kwiatkowski2019natural task: Given a natural question , a related Wikipedia page (in the top 5 search results returned by the Google search engine), the model outputs a paragraph within the Wikipedia page as the long answer which contains enough information to infer the answer to the question, and an entity span within the long answer that answers the question as the short answer. Also, the short answer of the 1% Wikipedia page is “yes” or “no”, instead of a short span. Both long answers and short answers can be (i.e., no such answer could be found).
Given a natural question and its paired Wikipedia page , we tokenize them with the 30,522 wordpiece vocabulary as used in devlin2018bert. Following alberti2019bert; pan2019frustratingly, we generate multiple document spans by splitting the Wikipedia page with a sliding window. Then, we obtain multiple 6-tuple training instances for each NQ data pair , where and are wordpiece IDs of question with length and document span with length , indicates the paragraph index of the long answer where is the set that includes all paragraph indexes (i.e, all long answer candidates) within , are inclusive indices pointing to the start and end of the short answer span, and represents the five answer types, corresponding to the labels “NULL” (no answer), “SHORT” (has short answer), “LONG” (only has long answer), “YES”, and “NO”.
For each tuple of the data pair , RikiNet takes and as inputs, and jointly predicts . Finally we merge the prediction results of every tuple to obtain the final predicted long answer, short answer, and their confidence scores of the data pair for evaluation.
We propose the RikiNet which Reads the Wikipedia pages for natural question answering. As shown in Fig. 1, RikiNet consists of two modules: (a) the dynamic paragraph dual-attention reader as described in §3.1, and (b) the multi-level cascaded answer predictor as described in §3.2.
3.1 Dynamic Paragraph Dual-Attention Reader
Dynamic Paragraph Dual-Attention (DPDA) reader aims to represent the document span and the question . It outputs the context-aware question representation, question-aware token-level document representation, and paragraph-level document representation, which will be all fed into the predictor to obtain the long and short answers.
3.1.1 Encoding Question and Document Span
We firstly employ a pre-trained language model such as BERT devlin2018bert to obtain the initial question representation and the initial document span representation , where is the hidden size. Similar to devlin2018bert, we concatenate a “[CLS]” token, the tokenized question with length , a “[SEP]” token, the tokenized document span with length , and a final “[SEP]” token. Then we feed the resulting sequence into the pre-trained language model.
3.1.2 Dynamic Paragraph Dual-Attention Block
As shown on the left in Fig. 1, DPDA reader contains multiple Dynamic Paragraph Dual-Attention (DPDA) blocks. The first block takes and as the inputs. The outputs and of the -th block are then fed into the next block. Each block contains three types of layers: the dual-attention layer, the paragraph dynamic self-attention layer, and the question self-attention layer. The last DPDA block outputs the final question and document representations. We describe them in detail now.
To strengthen the information fusion from the question to the paragraphs as well as from the paragraphs to the question, we adapt a dual-attention mechanism, which has been shown effective in other MRC models xiong2017dcn+; seo2016bidirectional; xiong2016dynamic
. We further tweak it by increasing the depth of attention followed by a residual connectionhe2016deep and layer normalization ba2016layer.
In particular, the -th block first calculates a similarity metric which is then normalized row-wise and column-wise to produce two attention weights: , across the document for each token in the question; and , across the question for each token in the document,
Similar to xiong2016dynamic; seo2016bidirectional, we obtain the question-aware representation of the document by
where denotes concatenation. We also obtain the context-aware question representation in a dual way:
We finally apply the residual connection and layer normalization to both the question and the document representations with the linear transformations.
where and are trainable parameters in the dual-attention layer of the -th block. The document representation will be fed into the paragraph dynamic self-attention layer to obtain the paragraph representation. The question representation will be fed into the question self-attention layer to get the question embedding.
Question Self-Attention Layer
This layer uses a transformer self-attention block vaswani2017attention to further enrich the question representation:
where the transformer block consists of two sub-layers: a multi-head self-attention layer and a position-wise fully connected feed-forward layer. Each sub-layer is placed inside a residual connection with layer normalization. After the last DPDA block, we obtain the final question embedding by applying the mean pooling,
where denotes the number of the DPDA blocks. This question embedding will be further fed into the predictor for answer type prediction.
Paragraph Dynamic Self-Attention Layer
This layer is responsible for gathering information on the key tokens in each paragraph. The token-level representation is first given by:
The difference from the original multi-head self-attention in vaswani2017attention is that we incorporate two extra attention masks, which will be introduced later in Eq. (3) and (4). The last DPDA block applies a mean pooling to the tokens within the same paragraph to obtain the paragraph representation as
where denotes the number of paragraph within the document span (i.e., the number of long answer candidates within the document span ), is the representation of the -th paragraph, is the representation of the -th token at last DPDA block, and indicates the index number of the paragraph where the -th token is located.
Tokens in the original multi-head attention layer of the transformer self-attention block attend to all tokens. We introduce two attention masks to the self-attention sub-layer in Eq. (1) based on two key motivations: 1) Each paragraph representation should focus on the question-aware token information inside the paragraph; 2) Most of the answers are only related to a few words in a paragraph. For the first motivation, we introduce the paragraph attention mask which is defined as:
It forces each token to only attend to the tokens within the same paragraph. Therefore, each paragraph representation focuses on its internal token information after the mean pooling of Eq. (2).
Based on the second motivation, we dynamically generate another attention mask to select key tokens before self-attention. We use a neural networkcalled scorer with the activation function to calculate the importance score for each token:
Then we obtain the dynamic attention mask by selecting top- tokens333Following zhuang2019token , our implementation pads the unselected token representations with zero embeddings and adds the scorer representation with the linear transformation to
, our implementation pads the unselected token representations with zero embeddings and adds the scorer representation with the linear transformation toto avoid gradient vanishing for scorer training.
where . Here denotes the score of the -th token at -th block,
is a hyperparameter, andis the set that includes the index of the selected top- tokens. This attention mask lets the paragraph representation concentrate on the selected key tokens.
The final scaled dot-product attention weight of the multi-head self-attention sub-layer vaswani2017attention in Eq. (1) with two proposed attention masks can be written as:
3.2 Multi-level Cascaded Answer Predictor
Due to the nature of the NQ tasks, a short answer is always contained within a long answer, and thus it makes sense to use the prediction of long answers to facilitate the process of obtaining short answers. As shown on the right in Fig. 1, we design a cascaded structure to exploit this dependency. This predictor takes the token representation , the paragraph representation , and the question embedding as inputs to predict four outputs in a cascaded manner: (1) long answer (2) the start position of the short answer span (3) the end position of the short answer span (4) the answer type. That is, the previous results are used for the next tasks as indicated by the notation “”.
Long Answer Prediction
We employ a dense layer with activation function as long answer prediction layer, which takes the paragraph representation as input to obtain the long-answer prediction representation
. Then the long-answer logitsare computed with a linear layer
where is a trainable parameter.
Short Answer Prediction
Firstly, we use the long-answer prediction representation and the token representation as the inputs to predict the start position of the short answer. Then the prediction representation of the start position of the short answer will be re-used to predict the end position.
Since the row-dimension of is different from that of , we cannot directly concatenate the to . We tile the with along the row-dimension: . Note that indicates the index number of the paragraph where the -th token is located. Thus, the model can consider the prediction information of the long answer when predicting the short answer. Similarly, the start and end position logits of the short answer are predicted by,
are the output logit vectors of the start positions and the end positions of the short answer,and are two dense layers with activation function, and , are trainable parameters.
Answer Type Prediction
Finally, the predictor outputs the answer type. There are five answer types as discussed in § 2. With the observation that humans can easily judge that some questions have no short answers even without seeing the document, we treat the question embedding as an auxiliary input for the answer type prediction. Besides, the token representation and the short-answer prediction representation are also used for that prediction:
where is the logits of the five answer types, is a dense layer with activation function, and is a trainable parameter.
Training Loss and Inference
For training, we compute cross-entropy loss over the above mentioned output logits, and jointly minimize these four cross-entropy losses as:
During inference, we calculate the final long-answer score for all the paragraphs within the Wikipedia page based on the long-answer logits and the answer type logits . The long-answer score of paragraph can be written as
where denotes the logits where the answer type is “NULL”(no answer), denotes the sum of the logits where the answer type is not “NULL”. The answer type score can be seen as a bias of each document span in the Wikipedia page. Then we select the paragraph of the highest long-answer score over the entire Wikipedia page as the long answer.
Similarly, the short-answer score of the corresponding span is calculate by
where denotes the score where the answer type is “SHORT”(has short answer). We select the short answer span which has the highest short-answer score within the long answer as the final short answer. We use the official NQ evaluation script to set two separate thresholds for predicting whether the two types of answers are answerable.
|LA Dev||LA Test||SA Dev||SA Test|
|DecAtt parikh2016decomposable + DocReader chen2017reading||52.7||57.0||54.8||54.3||55.7||55.0||34.3||28.9||31.4||31.9||31.1||31.5|
|BERTlarge + 4M synth NQ alberti2019synthetic||62.3||70.0||65.9||65.2||68.4||66.8||60.7||50.4||55.1||62.1||47.7||53.9|
|BERTjoint alberti2019bert + RoBERTa large liu2019roberta||65.6||69.1||67.3||-||-||-||60.9||51.0||55.5||-||-||-|
|BERTlarge + SQuAD2 PT + AoA pan2019frustratingly||-||-||68.2||-||-||-||-||-||57.2||-||-||-|
|BERTlarge + SSPT glass2019span||-||-||65.8||-||-||-||-||-||54.2||-||-||-|
|RikiNet-RoBERTa large (ensemble)||73.3||78.7||75.9||78.1||74.2||76.1||66.6||56.4||61.1||67.6||56.1||61.3|
|Single Human kwiatkowski2019natural||80.4||67.6||73.4||-||-||-||63.4||52.6||57.5||-||-||-|
We focus on the Natural Questions (NQ) kwiatkowski2019natural dataset in this work. The public release of the NQ dataset consists of 307,373 training examples and 7,830 examples for development data (dev set). NQ provides a blind test set contains 7,842 examples, which can only be accessed through a public leaderboard submission.
4.2 Implementation Details
As discussed in § 2, we generate multiple document spans by splitting the Wikipedia page with a sliding window. Following pan2019frustratingly; alberti2019bert
, the size and stride of the sliding window are set to 512 and 192 tokens respectively. The average number of document spans of one Wikipedia page is about 22. Since most of the document span does not contain the answer, the number of negative samples (i.e., no answer) and positive samples (i.e., has answers) is extremely imbalanced. We follow pan2019frustratingly; alberti2019bert to sub-sample negative instances for training, where the rate of sub-sampling negative instance is the same as in pan2019frustratingly. As a result, there are 469,062 training instances in total.
We use Adam optimizer kingma2014adam with a batch size of
for model training. The initial learning rate, the learning rate warmup proportion, the training epoch, the hidden size, the number of blocks , and the hyperparameter are set to , , , , , and respectively. Our model takes approximately 24 hours to train with 4 Nvidia Tesla P40. Evaluation completed in about 6 hours on the NQ dev and test set with a single Nvidia Tesla P100.
We use the Google released BERT-large model fine-tuned with synthetic self-training alberti2019synthetic to encode the document and question as described in § 3.1.1. We also compare the performance of RikiNet which uses the pre-trained RoBERTa large model liu2019roberta. It should be noted that our RikiNet is orthogonal to the choice of a particular pre-trained language model.
4.3 Main Results
We present a comparison between previously published works on the NQ task and our RikiNet. We report the results of the precision (P), the recall (R), and the F1 score for the long-answer (LA) and short-answer (SA) tasks on both test set and dev set in Tab. 1. The first two lines of Tab. 1 show the results of two multi-passage MRC baseline models presented in the original NQ paper kwiatkowski2019natural. The third to sixth lines show the results of the previous state-of-the-art models. These models all employ the BERTlarge model and perform better than that two baselines. Our RikiNet-BERTlarge also employs the BERTlarge model, and its single model has achieved a significant improvement over the previously published best model on the test set (LA from 66.8 F1 to 74.3 F1, and SA from 53.9 F1 to 57.9 F1). To the best of our knowledge, this is the first444The single RikiNet-BERTlarge model was submitted to the NQ public leaderboard on 7 Nov. 2019. single model that surpasses the single human performance kwiatkowski2019natural on both LA and SA tasks. We also provide a BERTjoint alberti2019bert + RoBERTa large liu2019roberta baseline on NQ, which only replaces the BERTlarge in BERTjoint method with RoBERTa large. To be expected, the BERTjoint + RoBERTa large performs better than original BERTjoint. Furthermore, our single model of RikiNet-RoBERTa large which employs RoBERTa large model also achieves better performance on both LA and SA, significantly outperforming BERTjoint + RoBERTa large. These results demonstrate the effectiveness of our RikiNet.
Since most submissions on the NQ leaderboard are ensemble models, we also report the results of our ensemble model, which consists of three RikiNet-RoBERTa large models with different hyper-parameters. At the time of submission (29 Nov. 2019), the NQ leaderboard shows that our ensemble model achieves the best performance on both LA (F1 76.1) and SA (F1 61.3).
4.4 Ablation Study
RikiNet consists of two key parts: DPDA reader and multi-level cascaded answer predictor. To get a better insight into RikiNet, we conduct an in-depth ablation study on probing these two modules. We report the LA and SA F1 scores on the dev set.
Ablations of DPDA Reader
We keep the predictor and remove the component of the DPDA reader. The results are shown in Tab. 2. In (a), we remove the entire DPDA reader as introduced in § 3.1 except BERTlarge. In (b), (c), and (d), we remove the dual-attention layer, question self-attention layer, and paragraph dynamic self-attention layer as described in § 3.1.1 respectively. In (e) and (f), we remove the paragraph attention mask of Eq. (3) and the dynamic attention mask of Eq. (4) respectively. We can see that after removing the DPDA reader, the performance drops sharply. In addition, the paragraph dynamic self-attention layer has the greatest impact on performance. Moreover, both the paragraph attention mask and dynamic attention mask contribute to the performance improvement.
We also change the hyper-parameter and the number of blocks . Results show that the setting of performs better than (i.e., no dynamic attention mask), and performs best. For the number of DPDA blocks , the model achieves the best performance when .
|Setting||LA F1||SA F1|
|(a) - DPDA reader||70.7||55.9|
|(b) - Dual-attention layer||73.1||56.6|
|(c) - Question self-attention layer||73.5||57.5|
|(d) - Paragraph self-attention layer||72.2||56.3|
|(e) - Paragraph attention mask||73.2||57.1|
|(f) - Dynamic attention mask||72.9||56.8|
Ablations of Predictor
On the predictor side, we further remove or replace its component and report the results in Tab. 3. In (1) we remove the whole DPDA reader and predictor. In (2), we remove the way of multi-level prediction (i.e., training the model to predict long and short answer jointly) described in § 3.2, and follow the previous work alberti2019bert to directly predict the short answer and then select its paragraph as the long answer. We can see that our multi-level prediction is critical to the long answer prediction. In (3) we only remove the cascaded structure but keep the multi-level prediction, which means that the prediction representations are no longer used as input for other predictions, the performance of both long and short answers drops about 1.0 F1 score. In (4) we change the ordering of cascaded process. That is instead of considering long answer first and then short answer as described in § 3.2, we consider the cascaded structure of short answer first and then long answer. However, we get slightly worse results in this way. In (5), we remove the question embedding which is used for answer type prediction. It can be observed that the question embedding contributes to performance improvement. In the variants of (6)-(9), we remove the dense prediction layers with
activation function and replace it with Bi-directional Long-Short Term Memory (Bi-LSTM)hochreiter1997long; schuster1997bidirectional layers, transformer self-attention blocks, and dense prediction layers with Gaussian Error Linear Unit hendrycks2016gaussian activation function but neither get better performance.
Overall, both proposed DPDA reader and multi-level cascaded answer predictor significantly improve the model performance.
|Setting||LA F1||SA F1|
|(1) - DPDA reader & Predictor||65.9||55.1|
|(2) - Multi-level prediction||70.9||57.1|
|(3) - Cascaded structure||73.0||56.7|
|(4) + S2L cascaded structure||73.6||57.5|
|(5) - Question embedding||73.4||57.4|
|(6) - Tanh dense prediction layer||73.2||57.3|
|(7) + Bi-LSTM prediction layer||73.3||57.4|
|(8) + Transformer prediction layer||73.5||57.5|
|(9) + GELU dense prediction layer||73.7||57.6|
5 Related Works
Natural Questions (NQ) dataset kwiatkowski2019natural has been recently proposed, where each question is paired with an entire Wikipedia page which is a long document containing multiple passages. Although BERT devlin2018bert based MRC models have surpassed human performance on several MRC benchmark datasets lan2019albert; devlin2018bert; liu2019roberta; rajpurkar2018know, a similar BERT method alberti2019bert still has a big gap with human performance on NQ dataset.
There are several recently proposed deep learning approaches for multi-passage reading comprehension.chen2017reading propose DrQA which contains a document retriever and a document reader (DocReader). clark2018simple introduce Document-QA which utilizes TF-IDF for paragraph selection and uses a shared normalization training objective. de2018question employ graph convolutional networks (GCNs) for this task. zhuang2019token design a gated token-level selection mechanism with a local convolution. In contrast, our RikiNet considers multi-level representations with a set of complementary attention mechanisms.
To solve the NQ task, kwiatkowski2019natural adapt Document-QA clark2018simple for NQ, and also utilizes DecAtt parikh2016decomposable for paragraph selection and DocReader chen2017reading for answer prediction. BERTjointalberti2019bert modifies BERT for NQ. Besides, some works focus on using data augmentation to improve the MRC models on NQ. alberti2019synthetic propose a synthetic QA corpora generation method based on roundtrip consistency. glass2019span propose a span selection method for BERT pre-training (SSPT). More recently, pan2019frustratingly introduce attention-over-attention cui2016attention into the BERT model. pan2019frustratingly also propose several techniques of data augmentation and model ensemble to further improve the model performance on NQ. Although the use of data augmentation and other advanced pre-trained language models lan2019albert may further improve model performance, as this is not the main focus of this paper, we leave them as our future work. Our RikiNet is a new MRC model designed tailored to the NQ challenges and can effectively represent the document and question at multi-levels to jointly predict the answers, which significantly outperforms the above methods.
We propose the RikiNet, which reads the Wikipedia pages to answer the natural question. The RikiNet consists of a dynamic paragraph dual-attention reader which learns the token-level, paragraph-level and question representations, and a multi-level cascaded answer predictor which jointly predicts the long and short answers in a cascade manner. On the Natural Questions dataset, the RikiNet is the first single model that outperforms the single human performance. Furthermore, the RikiNet ensemble achieves the new state-of-the-art results at 76.1 F1 on long-answer and 61.3 F1 on short-answer tasks, which significantly outperforms all the other models on both criteria.
This work is supported by National Natural Science Fund for Distinguished Young Scholar (Grant No. 61625204) and partially supported by the Key Program of National Science Foundation of China (Grant No. 61836006).