RikiNet: Reading Wikipedia Pages for Natural Question Answering

by   Dayiheng Liu, et al.

Reading long documents to answer open-domain questions remains challenging in natural language understanding. In this paper, we introduce a new model, called RikiNet, which reads Wikipedia pages for natural question answering. RikiNet contains a dynamic paragraph dual-attention reader and a multi-level cascaded answer predictor. The reader dynamically represents the document and question by utilizing a set of complementary attention mechanisms. The representations are then fed into the predictor to obtain the span of the short answer, the paragraph of the long answer, and the answer type in a cascaded manner. On the Natural Questions (NQ) dataset, a single RikiNet achieves 74.3 F1 and 57.9 F1 on long-answer and short-answer tasks. To our best knowledge, it is the first single model that outperforms the single human performance. Furthermore, an ensemble RikiNet obtains 76.1 F1 and 61.3 F1 on long-answer and short-answer tasks, achieving the best performance on the official NQ leaderboard



page 1

page 2

page 3

page 4


Ensembling Strategies for Answering Natural Questions

Many of the top question answering systems today utilize ensembling to i...

A BERT Baseline for the Natural Questions

This technical note describes a new baseline for the Natural Questions. ...

ReviewQA: a relational aspect-based opinion reading dataset

Deep reading models for question-answering have demonstrated promising p...

Jack the Reader - A Machine Reading Framework

Many Machine Reading and Natural Language Understanding tasks require re...

BiRdQA: A Bilingual Dataset for Question Answering on Tricky Riddles

A riddle is a question or statement with double or veiled meanings, foll...

Probabilistic Assumptions Matter: Improved Models for Distantly-Supervised Document-Level Question Answering

We address the problem of extractive question answering using document-l...

Conditioning LSTM Decoder and Bi-directional Attention Based Question Answering System

Applying neural-networks on Question Answering has gained increasing pop...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Machine reading comprehension (MRC) refers to the task of finding answers to given questions by reading and understanding some documents. It represents a challenging benchmark task in natural language understanding (NLU). With the progress of large-scale pre-trained language models devlin2018bert, state-of-the-art MRC models ju2019technical; yang2019xlnet; lan2019albert; zhang2019sg; liu2019roberta have already surpassed human-level performance on certain commonly used MRC benchmark datasets, such as SQuAD 1.1 rajpurkar2016squad, SQuAD 2.0 rajpurkar2018know, and CoQA reddy2019coqa.

Recently, a new benchmark MRC dataset called Natural Questions222NQ provides some visual examples of the data at https://ai.google.com/research/NaturalQuestions/visualization. (NQ) kwiatkowski2019natural has presented a substantially greater challenge for the existing MRC models. Specifically, there are two main challenges in NQ compared to the previous MRC datasets like SQuAD 2.0. Firstly, instead of providing one relatively short paragraph for each question-answer (QA) pair, NQ gives an entire Wikipedia page which is significantly longer compared to other datasets. Secondly, NQ task not only requires the model to find an answer span (called short answer) to the question like previous MRC tasks but also asks the model to find a paragraph that contains the information required to answer the question (called long answer).

In this paper, we focus on the NQ task and propose a new MRC model called RikiNet tailored to its associated challenges, which Reads the Wikipedia pages for natural question answering. For the first challenge of the NQ task mentioned above, RikiNet employs the proposed Dynamic Paragraph Dual-Attention (DPDA) reader which contains multiple DPDA blocks. In each DPDA block, we iteratively perform dual-attention to represent documents and questions, and employ paragraph self-attention with dynamic attention mask to fuse key tokens in each paragraph. The resulting context-aware question representation, question-aware token-level, and paragraph-level representations are fed into the predictor to obtain the answer. The motivations of designing DPDA reader are: (a) Although the entire Wikipedia page contains a large amount of text, one key observation is that most answers are only related to a few words in one paragraph; (b) The final paragraph representation can be used naturally for predicting long answers. We describe the details of DPDA reader in § 3.1.

For the second challenge, unlike prior works on NQ dataset alberti2019bert; pan2019frustratingly that only predict the short answer and directly select its paragraph as long answer, RikiNet employs a multi-level cascaded answer predictor which jointly predict the short answer span, the long answer paragraph, and the answer type in a cascaded manner. Another key intuition motivating our design is that even if the relevant documents are not given, humans can easily judge that some questions have no short answers benjamin2019. Take this question as a motivating example:“What is the origin of the Nobel prize?” The answer should be based on a long story, which cannot be easily expressed in a short span of entities. Therefore we also feed the question representation into the predictor as an auxiliary prior to answer type prediction. The details will be given in § 3.2.

On the NQ test set, our single model obtains 74.3 F1 scores on the long-answer task (LA) and 57.9 F1 scores on the short-answer task (SA) compared to the published best single model alberti2019synthetic results of 66.8 F1 on LA and 53.9 F1 on SA. To the best of our knowledge, RikiNet is the first single model that outperforms the single human performance kwiatkowski2019natural on both LA and SA. Besides, our ensemble model obtains 76.1 F1 on LA and 61.3 F1 on SA, which achieves the best performance of both LA and SA on the official NQ leaderboard.

2 Preliminaries

Before we describe our model in detail, we first introduce the notations and problem formalization. Our paper considers the following NQ kwiatkowski2019natural task: Given a natural question , a related Wikipedia page (in the top 5 search results returned by the Google search engine), the model outputs a paragraph within the Wikipedia page as the long answer which contains enough information to infer the answer to the question, and an entity span within the long answer that answers the question as the short answer. Also, the short answer of the 1% Wikipedia page is “yes” or “no”, instead of a short span. Both long answers and short answers can be (i.e., no such answer could be found).

Given a natural question and its paired Wikipedia page , we tokenize them with the 30,522 wordpiece vocabulary as used in devlin2018bert. Following alberti2019bert; pan2019frustratingly, we generate multiple document spans by splitting the Wikipedia page with a sliding window. Then, we obtain multiple 6-tuple training instances for each NQ data pair , where and are wordpiece IDs of question with length and document span with length , indicates the paragraph index of the long answer where is the set that includes all paragraph indexes (i.e, all long answer candidates) within , are inclusive indices pointing to the start and end of the short answer span, and represents the five answer types, corresponding to the labels “NULL” (no answer), “SHORT” (has short answer), “LONG” (only has long answer), “YES”, and “NO”.

For each tuple of the data pair , RikiNet takes and as inputs, and jointly predicts . Finally we merge the prediction results of every tuple to obtain the final predicted long answer, short answer, and their confidence scores of the data pair for evaluation.

Figure 1: Overview of RikiNet framework.

3 Methodology

We propose the RikiNet which Reads the Wikipedia pages for natural question answering. As shown in Fig. 1, RikiNet consists of two modules: (a) the dynamic paragraph dual-attention reader as described in §3.1, and (b) the multi-level cascaded answer predictor as described in §3.2.

3.1 Dynamic Paragraph Dual-Attention Reader

Dynamic Paragraph Dual-Attention (DPDA) reader aims to represent the document span and the question . It outputs the context-aware question representation, question-aware token-level document representation, and paragraph-level document representation, which will be all fed into the predictor to obtain the long and short answers.

3.1.1 Encoding Question and Document Span

We firstly employ a pre-trained language model such as BERT devlin2018bert to obtain the initial question representation and the initial document span representation , where is the hidden size. Similar to devlin2018bert, we concatenate a “[CLS]” token, the tokenized question with length , a “[SEP]” token, the tokenized document span with length , and a final “[SEP]” token. Then we feed the resulting sequence into the pre-trained language model.

3.1.2 Dynamic Paragraph Dual-Attention Block

As shown on the left in Fig. 1, DPDA reader contains multiple Dynamic Paragraph Dual-Attention (DPDA) blocks. The first block takes and as the inputs. The outputs and of the -th block are then fed into the next block. Each block contains three types of layers: the dual-attention layer, the paragraph dynamic self-attention layer, and the question self-attention layer. The last DPDA block outputs the final question and document representations. We describe them in detail now.

Dual-Attention Layer

To strengthen the information fusion from the question to the paragraphs as well as from the paragraphs to the question, we adapt a dual-attention mechanism, which has been shown effective in other MRC models xiong2017dcn+; seo2016bidirectional; xiong2016dynamic

. We further tweak it by increasing the depth of attention followed by a residual connection 

he2016deep and layer normalization ba2016layer.

In particular, the -th block first calculates a similarity metric which is then normalized row-wise and column-wise to produce two attention weights: , across the document for each token in the question; and , across the question for each token in the document,

Similar to  xiong2016dynamic; seo2016bidirectional, we obtain the question-aware representation of the document by

where denotes concatenation. We also obtain the context-aware question representation in a dual way:

We finally apply the residual connection and layer normalization to both the question and the document representations with the linear transformations.

where and are trainable parameters in the dual-attention layer of the -th block. The document representation will be fed into the paragraph dynamic self-attention layer to obtain the paragraph representation. The question representation will be fed into the question self-attention layer to get the question embedding.

Question Self-Attention Layer

This layer uses a transformer self-attention block vaswani2017attention to further enrich the question representation:

where the transformer block consists of two sub-layers: a multi-head self-attention layer and a position-wise fully connected feed-forward layer. Each sub-layer is placed inside a residual connection with layer normalization. After the last DPDA block, we obtain the final question embedding by applying the mean pooling,

where denotes the number of the DPDA blocks. This question embedding will be further fed into the predictor for answer type prediction.

Paragraph Dynamic Self-Attention Layer

This layer is responsible for gathering information on the key tokens in each paragraph. The token-level representation is first given by:


The difference from the original multi-head self-attention in vaswani2017attention is that we incorporate two extra attention masks, which will be introduced later in Eq. (3) and (4). The last DPDA block applies a mean pooling to the tokens within the same paragraph to obtain the paragraph representation as


where denotes the number of paragraph within the document span (i.e., the number of long answer candidates within the document span ), is the representation of the -th paragraph, is the representation of the -th token at last DPDA block, and indicates the index number of the paragraph where the -th token is located.

Tokens in the original multi-head attention layer of the transformer self-attention block attend to all tokens. We introduce two attention masks to the self-attention sub-layer in Eq. (1) based on two key motivations: 1) Each paragraph representation should focus on the question-aware token information inside the paragraph; 2) Most of the answers are only related to a few words in a paragraph. For the first motivation, we introduce the paragraph attention mask which is defined as:


It forces each token to only attend to the tokens within the same paragraph. Therefore, each paragraph representation focuses on its internal token information after the mean pooling of Eq. (2).

Based on the second motivation, we dynamically generate another attention mask to select key tokens before self-attention. We use a neural network

called scorer with the activation function to calculate the importance score for each token:

Then we obtain the dynamic attention mask by selecting top- tokens333Following zhuang2019token

, our implementation pads the unselected token representations with zero embeddings and adds the scorer representation with the linear transformation to

to avoid gradient vanishing for scorer training.


where . Here denotes the score of the -th token at -th block,

is a hyperparameter, and

is the set that includes the index of the selected top- tokens. This attention mask lets the paragraph representation concentrate on the selected key tokens.

The final scaled dot-product attention weight of the multi-head self-attention sub-layer vaswani2017attention in Eq. (1) with two proposed attention masks can be written as:

3.2 Multi-level Cascaded Answer Predictor

Due to the nature of the NQ tasks, a short answer is always contained within a long answer, and thus it makes sense to use the prediction of long answers to facilitate the process of obtaining short answers. As shown on the right in Fig. 1, we design a cascaded structure to exploit this dependency. This predictor takes the token representation , the paragraph representation , and the question embedding as inputs to predict four outputs in a cascaded manner: (1) long answer (2) the start position of the short answer span (3) the end position of the short answer span (4) the answer type. That is, the previous results are used for the next tasks as indicated by the notation “”.

Long Answer Prediction

We employ a dense layer with activation function as long answer prediction layer, which takes the paragraph representation as input to obtain the long-answer prediction representation

. Then the long-answer logits

are computed with a linear layer

where is a trainable parameter.

Short Answer Prediction

Firstly, we use the long-answer prediction representation and the token representation as the inputs to predict the start position of the short answer. Then the prediction representation of the start position of the short answer will be re-used to predict the end position.

Since the row-dimension of is different from that of , we cannot directly concatenate the to . We tile the with along the row-dimension: . Note that indicates the index number of the paragraph where the -th token is located. Thus, the model can consider the prediction information of the long answer when predicting the short answer. Similarly, the start and end position logits of the short answer are predicted by,

where and

are the output logit vectors of the start positions and the end positions of the short answer,

and are two dense layers with activation function, and , are trainable parameters.

Answer Type Prediction

Finally, the predictor outputs the answer type. There are five answer types as discussed in § 2. With the observation that humans can easily judge that some questions have no short answers even without seeing the document, we treat the question embedding as an auxiliary input for the answer type prediction. Besides, the token representation and the short-answer prediction representation are also used for that prediction:

where is the logits of the five answer types, is a dense layer with activation function, and is a trainable parameter.

Training Loss and Inference

For training, we compute cross-entropy loss over the above mentioned output logits, and jointly minimize these four cross-entropy losses as:

During inference, we calculate the final long-answer score for all the paragraphs within the Wikipedia page based on the long-answer logits and the answer type logits . The long-answer score of paragraph can be written as

where denotes the logits where the answer type is “NULL”(no answer), denotes the sum of the logits where the answer type is not “NULL”. The answer type score can be seen as a bias of each document span in the Wikipedia page. Then we select the paragraph of the highest long-answer score over the entire Wikipedia page as the long answer.

Similarly, the short-answer score of the corresponding span is calculate by

where denotes the score where the answer type is “SHORT”(has short answer). We select the short answer span which has the highest short-answer score within the long answer as the final short answer. We use the official NQ evaluation script to set two separate thresholds for predicting whether the two types of answers are answerable.

4 Experiments

LA Dev LA Test SA Dev SA Test
P R F1 P R F1 P R F1 P R F1
DocumentQA clark2018simple 47.5 44.7 46.1 48.9 43.3 45.7 38.6 33.2 35.7 40.6 31.0 35.1
DecAtt parikh2016decomposable + DocReader chen2017reading 52.7 57.0 54.8 54.3 55.7 55.0 34.3 28.9 31.4 31.9 31.1 31.5
BERTjoint alberti2019bert 61.3 68.4 64.7 64.1 68.3 66.2 59.5 47.3 52.7 63.8 44.0 52.1
BERTlarge + 4M synth NQ  alberti2019synthetic 62.3 70.0 65.9 65.2 68.4 66.8 60.7 50.4 55.1 62.1 47.7 53.9
BERTjoint alberti2019bert + RoBERTa large liu2019roberta 65.6 69.1 67.3 - - - 60.9 51.0 55.5 - - -
BERTlarge + SQuAD2 PT + AoA pan2019frustratingly - - 68.2 - - - - - 57.2 - - -
BERTlarge + SSPT glass2019span - - 65.8 - - - - - 54.2 - - -
RikiNet-BERTlarge 73.2 74.5 73.9 74.2 74.4 74.3 61.1 54.7 57.7 63.5 53.2 57.9
RikiNet-RoBERTa large 74.3 76.4 75.3 - - - 61.4 57.3 59.3 - - -
RikiNet-BERTlarge (ensemble) 74.4 76.3 75.4 75.3 75.9 75.6 66.9 53.8 59.6 63.2 56.1 59.5
RikiNet-RoBERTa large (ensemble) 73.3 78.7 75.9 78.1 74.2 76.1 66.6 56.4 61.1 67.6 56.1 61.3
Single Human kwiatkowski2019natural 80.4 67.6 73.4 - - - 63.4 52.6 57.5 - - -
Super-annotator kwiatkowski2019natural 90.0 84.6 87.2 - - - 79.1 72.6 75.7 - - -
Table 1: Performance comparisons on the dev set and the blind test set of the NQ dataset. We report the evaluation results of the precision (P), the recall (R), and the F1 score for both long-answer (LA) and short-answer (SA) tasks. We use background color to highlight the column of F1 results. refers to the works that only provide the F1 results on the dev set in their paper. refers to our implementations where we only report the results on the dev set, due to the NQ leaderboard submission rules (each participant is only allowed to submit once per week).

4.1 Dataset

We focus on the Natural Questions (NQ) kwiatkowski2019natural dataset in this work. The public release of the NQ dataset consists of 307,373 training examples and 7,830 examples for development data (dev set). NQ provides a blind test set contains 7,842 examples, which can only be accessed through a public leaderboard submission.

4.2 Implementation Details

As discussed in § 2, we generate multiple document spans by splitting the Wikipedia page with a sliding window. Following pan2019frustratingly; alberti2019bert

, the size and stride of the sliding window are set to 512 and 192 tokens respectively. The average number of document spans of one Wikipedia page is about 22. Since most of the document span does not contain the answer, the number of negative samples (

i.e., no answer) and positive samples (i.e., has answers) is extremely imbalanced. We follow pan2019frustratingly; alberti2019bert to sub-sample negative instances for training, where the rate of sub-sampling negative instance is the same as in pan2019frustratingly. As a result, there are 469,062 training instances in total.

We use Adam optimizer kingma2014adam with a batch size of

for model training. The initial learning rate, the learning rate warmup proportion, the training epoch, the hidden size

, the number of blocks , and the hyperparameter are set to , , , , , and respectively. Our model takes approximately 24 hours to train with 4 Nvidia Tesla P40. Evaluation completed in about 6 hours on the NQ dev and test set with a single Nvidia Tesla P100.

We use the Google released BERT-large model fine-tuned with synthetic self-training alberti2019synthetic to encode the document and question as described in § 3.1.1. We also compare the performance of RikiNet which uses the pre-trained RoBERTa large model liu2019roberta. It should be noted that our RikiNet is orthogonal to the choice of a particular pre-trained language model.

4.3 Main Results

We present a comparison between previously published works on the NQ task and our RikiNet. We report the results of the precision (P), the recall (R), and the F1 score for the long-answer (LA) and short-answer (SA) tasks on both test set and dev set in Tab. 1. The first two lines of Tab. 1 show the results of two multi-passage MRC baseline models presented in the original NQ paper kwiatkowski2019natural. The third to sixth lines show the results of the previous state-of-the-art models. These models all employ the BERTlarge model and perform better than that two baselines. Our RikiNet-BERTlarge also employs the BERTlarge model, and its single model has achieved a significant improvement over the previously published best model on the test set (LA from 66.8 F1 to 74.3 F1, and SA from 53.9 F1 to 57.9 F1). To the best of our knowledge, this is the first444The single RikiNet-BERTlarge model was submitted to the NQ public leaderboard on 7 Nov. 2019. single model that surpasses the single human performance kwiatkowski2019natural on both LA and SA tasks. We also provide a BERTjoint alberti2019bert + RoBERTa large liu2019roberta baseline on NQ, which only replaces the BERTlarge in BERTjoint method with RoBERTa large. To be expected, the BERTjoint + RoBERTa large performs better than original BERTjoint. Furthermore, our single model of RikiNet-RoBERTa large which employs RoBERTa large model also achieves better performance on both LA and SA, significantly outperforming BERTjoint + RoBERTa large. These results demonstrate the effectiveness of our RikiNet.

Since most submissions on the NQ leaderboard are ensemble models, we also report the results of our ensemble model, which consists of three RikiNet-RoBERTa large models with different hyper-parameters. At the time of submission (29 Nov. 2019), the NQ leaderboard shows that our ensemble model achieves the best performance on both LA (F1 76.1) and SA (F1 61.3).

4.4 Ablation Study

RikiNet consists of two key parts: DPDA reader and multi-level cascaded answer predictor. To get a better insight into RikiNet, we conduct an in-depth ablation study on probing these two modules. We report the LA and SA F1 scores on the dev set.

Ablations of DPDA Reader

We keep the predictor and remove the component of the DPDA reader. The results are shown in Tab. 2. In (a), we remove the entire DPDA reader as introduced in § 3.1 except BERTlarge. In (b), (c), and (d), we remove the dual-attention layer, question self-attention layer, and paragraph dynamic self-attention layer as described in § 3.1.1 respectively. In (e) and (f), we remove the paragraph attention mask of Eq. (3) and the dynamic attention mask of Eq. (4) respectively. We can see that after removing the DPDA reader, the performance drops sharply. In addition, the paragraph dynamic self-attention layer has the greatest impact on performance. Moreover, both the paragraph attention mask and dynamic attention mask contribute to the performance improvement.

We also change the hyper-parameter and the number of blocks . Results show that the setting of performs better than (i.e., no dynamic attention mask), and performs best. For the number of DPDA blocks , the model achieves the best performance when .

Setting LA F1 SA F1
RikiNet-BERTlarge (Full) 73.9 57.7
(a) - DPDA reader 70.7 55.9
(b) - Dual-attention layer 73.1 56.6
(c) - Question self-attention layer 73.5 57.5
(d) - Paragraph self-attention layer 72.2 56.3
(e) - Paragraph attention mask 73.2 57.1
(f) - Dynamic attention mask 72.9 56.8
RikiNet-BERTlarge () 72.9 56.8
RikiNet-BERTlarge () 73.7 57.3
RikiNet-BERTlarge () 73.9 57.7
RikiNet-BERTlarge () 73.7 56.9
RikiNet-BERTlarge () 70.7 55.9
RikiNet-BERTlarge () 73.6 57.6
RikiNet-BERTlarge () 73.9 57.7
RikiNet-BERTlarge () 73.5 57.1
RikiNet-BERTlarge () 73.0 56.9

Table 2: Ablations of DPDA reader on dev set of NQ dataset.
Ablations of Predictor

On the predictor side, we further remove or replace its component and report the results in Tab. 3. In (1) we remove the whole DPDA reader and predictor. In (2), we remove the way of multi-level prediction (i.e., training the model to predict long and short answer jointly) described in § 3.2, and follow the previous work alberti2019bert to directly predict the short answer and then select its paragraph as the long answer. We can see that our multi-level prediction is critical to the long answer prediction. In (3) we only remove the cascaded structure but keep the multi-level prediction, which means that the prediction representations are no longer used as input for other predictions, the performance of both long and short answers drops about 1.0 F1 score. In (4) we change the ordering of cascaded process. That is instead of considering long answer first and then short answer as described in § 3.2, we consider the cascaded structure of short answer first and then long answer. However, we get slightly worse results in this way. In (5), we remove the question embedding which is used for answer type prediction. It can be observed that the question embedding contributes to performance improvement. In the variants of (6)-(9), we remove the dense prediction layers with

activation function and replace it with Bi-directional Long-Short Term Memory (Bi-LSTM) 

hochreiter1997long; schuster1997bidirectional layers, transformer self-attention blocks, and dense prediction layers with Gaussian Error Linear Unit  hendrycks2016gaussian activation function but neither get better performance.

Overall, both proposed DPDA reader and multi-level cascaded answer predictor significantly improve the model performance.

Setting LA F1 SA F1
RikiNet-BERTlarge (Full) 73.9 57.7
(1) - DPDA reader & Predictor 65.9 55.1
(2) - Multi-level prediction 70.9 57.1
(3) - Cascaded structure 73.0 56.7
(4)     + S2L cascaded structure 73.6 57.5
(5) - Question embedding 73.4 57.4
(6) - Tanh dense prediction layer 73.2 57.3
(7)     + Bi-LSTM prediction layer 73.3 57.4
(8)     + Transformer prediction layer 73.5 57.5
(9)     + GELU dense prediction layer 73.7 57.6
Table 3: Ablations of multi-level cascaded predictor on dev set of NQ dataset.

5 Related Works

Natural Questions (NQ) dataset kwiatkowski2019natural has been recently proposed, where each question is paired with an entire Wikipedia page which is a long document containing multiple passages. Although BERT devlin2018bert based MRC models have surpassed human performance on several MRC benchmark datasets lan2019albert; devlin2018bert; liu2019roberta; rajpurkar2018know, a similar BERT method alberti2019bert still has a big gap with human performance on NQ dataset.

There are several recently proposed deep learning approaches for multi-passage reading comprehension.

chen2017reading propose DrQA which contains a document retriever and a document reader (DocReader). clark2018simple introduce Document-QA which utilizes TF-IDF for paragraph selection and uses a shared normalization training objective. de2018question employ graph convolutional networks (GCNs) for this task. zhuang2019token design a gated token-level selection mechanism with a local convolution. In contrast, our RikiNet considers multi-level representations with a set of complementary attention mechanisms.

To solve the NQ task, kwiatkowski2019natural adapt Document-QA clark2018simple for NQ, and also utilizes DecAtt parikh2016decomposable for paragraph selection and DocReader chen2017reading for answer prediction. BERTjointalberti2019bert modifies BERT for NQ. Besides, some works focus on using data augmentation to improve the MRC models on NQ. alberti2019synthetic propose a synthetic QA corpora generation method based on roundtrip consistency. glass2019span propose a span selection method for BERT pre-training (SSPT). More recently, pan2019frustratingly introduce attention-over-attention cui2016attention into the BERT model. pan2019frustratingly also propose several techniques of data augmentation and model ensemble to further improve the model performance on NQ. Although the use of data augmentation and other advanced pre-trained language models lan2019albert may further improve model performance, as this is not the main focus of this paper, we leave them as our future work. Our RikiNet is a new MRC model designed tailored to the NQ challenges and can effectively represent the document and question at multi-levels to jointly predict the answers, which significantly outperforms the above methods.

6 Conclusion

We propose the RikiNet, which reads the Wikipedia pages to answer the natural question. The RikiNet consists of a dynamic paragraph dual-attention reader which learns the token-level, paragraph-level and question representations, and a multi-level cascaded answer predictor which jointly predicts the long and short answers in a cascade manner. On the Natural Questions dataset, the RikiNet is the first single model that outperforms the single human performance. Furthermore, the RikiNet ensemble achieves the new state-of-the-art results at 76.1 F1 on long-answer and 61.3 F1 on short-answer tasks, which significantly outperforms all the other models on both criteria.


This work is supported by National Natural Science Fund for Distinguished Young Scholar (Grant No. 61625204) and partially supported by the Key Program of National Science Foundation of China (Grant No. 61836006).