Relation-Guided Pre-Training for Open-Domain Question Answering

09/21/2021 ∙ by Ziniu Hu, et al. ∙ 0

Answering complex open-domain questions requires understanding the latent relations between involving entities. However, we found that the existing QA datasets are extremely imbalanced in some types of relations, which hurts the generalization performance over questions with long-tail relations. To remedy this problem, in this paper, we propose a Relation-Guided Pre-Training (RGPT-QA) framework. We first generate a relational QA dataset covering a wide range of relations from both the Wikidata triplets and Wikipedia hyperlinks. We then pre-train a QA model to infer the latent relations from the question, and then conduct extractive QA to get the target answer entity. We demonstrate that by pretraining with propoed RGPT-QA techique, the popular open-domain QA model, Dense Passage Retriever (DPR), achieves 2.2 improvement in Exact Match accuracy on Natural Questions, TriviaQA, and WebQuestions. Particularly, we show that RGPT-QA improves significantly on questions with long-tail relations



There are no comments yet.


page 2

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Open domain question answering is a challenging task that answers factoid questions based on evidence in a large corpus (e.g., Wikipedia). Most open-domain QA systems follow retriever-reader pipeline DBLP:conf/acl/ChenFWB17, in which a retriever selects a subset of candidate entities and associated passages from the corpus that might contain the answer, then a reader extracts a text span from the passages as the answer. This process involves multiple entities that are relevant to answer the question. The QA system is required to extract these entities from the question and passages and identify the (latent) semantic relations between these entities in order to answer the question. For example, to answer the following question: “Where did Steph Curry play college basketball at?”, the QA model is required to reason the implicit relation triplet Steph Curry, Educated At, Davidson College to identify the correct answer.

To capture the relation knowledge required to answer questions, most QA systems rely on human-annotated supervised QA datasets. However, it is expensive and tedious to annotate a large set of QA pairs that cover enough relational facts for training a strong QA model. In addition, we showed that even for a large QA dataset like Natural Questions DBLP:journals/tacl/KwiatkowskiPRCP19, its training set only covers 16.4% of relations in WikiData DBLP:journals/cacm/VrandecicK14knowledge graph. Moreover, for those covered relations, the frequency distribution is imbalanced, i.e., 30% of relation types appear only once. Consequently, for the questions involving infrequent (a.k.a, long-tail) relations in the training set, the QA exact match accuracy is 22.4% lower than average. Such a biased relation distribution of existing QA datasets severely hurts the generalization of trained QA systems.

To improve the open-domain QA systems for questions with long-tail relations, in this paper, we propose RGPT-QA, a simple yet effective Relation-Guided Pre-training framework for training QA models with augmented relationa facts from knowledge graph. The framework consists of two steps: 1) generate a relational QA dataset that covers a wide range of relations without human labeling; 2) pre-train a QA model to predict latent relations from questions and conduct extractive QA.

Figure 1: Cumulative distribution function (CDF) of relation frequency in Natural Question Training set.
Figure 2: Exact Match accuracy of a trained DPR model in validation set with different relation frequency in training set.

The key of our framework is to generate a relational QA dataset that align entities in Wikipedia passages with structured knowledge graph (e.g., WikiData). We call such a dataset Grounded Relational Wiki-Graph. In this graph, each edge indicates the relationship of two connected entities, and the edge is linked to a passage in Wikipedia describing this relationship. As WikiData knowledge graph also suffers from low coverage of long-tail entities and relations, we further convert hyperlinks in Wikipedia into knowledge triplets without specifying relation labels. Next, we link each relation triplet to a Wikipedia passage to help generate natural questions. We assume that if one passage in the Wiki-page of source entity contains the target entity, then the context in this passage describes the relationship between the two entities. With the constructed graph, we use a template to synthesize question and answer pairs and then pre-train the QA model to capture the relational facts for answering complex open-domain questions.

As a pre-training method, RGPT-QA can be incorporated with any open-domain QA system. In this paper, we utilize the recently developed Dense Passage Retriever (DPR) DBLP:journals/corr/abs-2004-04906 as the base QA system to evaluate the proposed pre-training effectiveness. Experimental results show that RGPT-QA enhances DPR’s Exact Match accuracy by 2.2, 2.4, and 6.3 on Natural Questions, TriviaQA and WebQuestions respectively. Compared with the existing QA pre-training methods DBLP:conf/acl/LeeCT19; DBLP:journals/corr/abs-2002-08909; DBLP:conf/acl/LewisDR19, RGPT-QA explicitly captures a wide range of relational facts and thus achieves better performance. Moreover, for the questions containing long-tail relations in Natural Questions, the performance is improved by 10.9, showing that RGPT-QA alleviates the unbalanced relation distribution problem in the existing QA datasets.

The key contributions of this paper are:

  • We propose RGPT-QA, a pre-training method to inject knowledge from relational facts in knowledge graph into QA models.

  • RGPT-QA enhances the performance of a popular QA model, i.e., DPR, especially on the questions with long-tail relations.

2 Preliminary and Empirical Analysis

In this section, we firstly introduce the retriever-reader pipeline for open-domain QA, and then we analyze how the relation distribution in existing QA datsets influence generalization performance.

Open-Domain Question Answering.

We focus on open-domain question answering that requires to extract answer from a large corpus (e.g. Wikipedia) containing passages. Most open-domain QA systems follow a retriever-reader pipeline proposed by DBLP:conf/acl/ChenFWB17. Given a factoid question , the QA system first retrieves relevant passages from the corpus . Then a reading comprehension module extracts a text span , …, from one of these retrieved passages as the answer to the question. Some QA dataset annotated the passage where the answer is derived. We called this passage ground truth passage.

For the retriever, earlier systems utilize term-based retrieval methods, such as TF-IDF and BM25, which fails to capture the semantic relationship between question and passage beyond lexical matching. Recent studies DBLP:conf/acl/LeeCT19; DBLP:journals/corr/abs-2004-04906; DBLP:conf/iclr/DhingraZBNSC20 use BERT-like pretrained language model to encode the question and passages independently into dense representations, and use maximum inner product search (MIPS) algorithms DBLP:conf/nips/Shrivastava014 to efficiently retrieve the most similar passage for each question. In this paper, we utilize Dense Passage Retriever (DPR) DBLP:journals/corr/abs-2004-04906 as the base QA model.

Relation Bias of Existing QA Datasets.

We first explore how much relational knowledge between entities is required to answer the questions in the existing open-domain QA dataset. We conduct an empirical study to analyze the relation distribution in Natural Questions, one of the largest open-domain QA datasets, and how it influences QA model’s performance.

For each question in Natural Question training set, we first select the entity that the ground-truth passage is associated with. We then combine the entity with the answer as an entity pair, and check whether we can find a relation triplet in WikiData describing the relation between these two entities. Out of 58,880 training QA pairs, there are 23,499 pairs that could be aligned. The aligned QA pairs cover 329 relations, which accounts for 16.4% of the total 2,008 relations in WikiData. For most unaligned QA pairs, the answers are not entities and thus cannot be aligned to the graph.

In addition to the low relation coverage issue in Natural Question, we also find that the relation distribution is imbalanced. As showed in Figure 2, 90% of relations have frequency less than 41, and 30% of relations appear only once. On the contrary, the most frequent relation “P161 (cast member)” appears 1,915 times out of 9,238 aligned QA pairs. A complete list of all these relations with aligned QA pairs is shown in Table 6-9 in Appendix.

We then study whether the imbalanced relation distribution influences the performance of QA models trained on these datasets. We use a DPR model trained on training set of Natural Questions and then calculate the Exact Match accuracy in validation set of each aligned QA pairs. We then analyze the correlation of the accuracy with the relation frequency in training set. As illustrated in Figure 2, the validation set accuracy is overall proportional to the relation frequency in training set. For those relations with frequency less than 5, the average accuracy is only 20.3%, much lower than the average accuracy 42.7% over all samples in validation set. This shows that the relation bias in existing QA datasets severely influences the generalization of QA models to questions with long-tail relations.

3 Method

In this section, we will discuss RGPT-QA framework in: 1) how to generate relational QA dataset for the pre-training purpose; and 2) how to construct a self-training task to empower QA model to capture relational facts.

3.1 Construct QA Pre-Training Dataset

To help QA model capture the knowledge from relation facts required to answer open-domain questions, we first focus on generating QA pre-training dataset, in which there exist relation connections between the source entity in questions to the target answer. Specifically, each QA pair datapoint consists of three components: 1) relational triplet , in which denotes the relation between source entity and target entity ; 2) question in natural language asking which entity has relation to source entity s, with target entity as the correct answer; 3) positive context passage , a passage from source entity’s Wiki-page that contains the target answer .

Grounded Relational Wiki-Graph.

To generate QA pre-training dataset, leveraging the relation triplets in knowledge graph, e.g., WikiData, is a natural choice to define questions that require relation reasoning. We therefore construct Grounded Relational Wiki-Graph, in which each relation triplet is linked to a set of description passages in the Wiki-page of entity . These descriptions would be later utilized to generate questions and positive context passages .

To construct such a graph, we use the 2021 Jan. English dump of Wikidata and Wikipedia. For each Wikipedia hyperlink ( denotes the relation is unlabeled), the passage containing anchored text to in the Wiki-page of naturally fits our requirement for . For each WikiData relation triplet , if the two entities are linked by a hyperlink in Wikipedia, we label the relation of the aligned hyperlink as . For the other triplets without alignment with hyperlinks, we extract all mentioning of target entity from the Wiki-page of , and use the context passage as . The dataset statistics are shown in Table 1.

of linked Entity 5,640,366
of relation labels 2,008
of labelled triplet 14,463,728
of unlabeled triplet (hyperlink) 66,796,110
of grounded descriptions per triplet 1.25
Table 1: Statistics of Grounded Relational Wiki-Graph.
Figure 3: Example of a generated relational QA pair from Grounded Relational Wiki-Graph.
Relational QA Pair Generation

In the following, we introduce the details to generate the relational QA pair from the constructed graph.

Recent unsupervised QA studies DBLP:conf/acl/LiWDWX20; DBLP:journals/corr/abs-2010-12623 revealed that if the question and context passage share a large lexical overlap, then the QA model could utilize low-level lexical patterns as shortcuts to find the answer. These shortcuts hinder the model from learning to comprehend the passages and answer the questions, hurting model’s generalizability. To avoid this lexical overlap issue, we aim to generate questions from a passage that is different from the context passage .

We first select all the entity pairs that have mutual links in the Grounded Relational Wiki-Graph, with and in part of Wikipage of and respectively , describing the relationship between the two entities. Without loss of generality, we denote as source entity and as the target answer. The passage containing target answer can be used as the positive passage .

Next, we generate a question that is lexically different from using the following template:

in which MASK is a relation mask token. As contains source entity , it provides information to describe the relationship between and , based on which the QA model should learn to infer the latent relation , and retrieve positive passage and extract answer entity . In addition, as and come from different Wiki-page, our question generation procedure can avoid the lexical overlap issue that often occur in prior Unsupervised QA methods.

Mask Target Answer.

As description is from target answer ’s wiki-page, it often contains the name of entity . We thus need to mask from the question. Otherwise, the pre-trained model can simply identify the answer to a question based on the local patterns.

As an example, in Figure 3, we show how to generate question for triplet . We firstly retrieve two descriptive passages and in two entities’ wiki pages. Using the template, we generate the question along with the ground-truth passage. We then mask out the target entity in question and source entity in true passage (will discuss later in retrieval pre-training) to avoid shortcut. A list of generated relational QA pairs are shown in Table 10 in Appendix.

3.2 Relation-Guided QA Pre-Training

With the generated relational QA dataset, we introduce how to pre-train both retriever and reader components in the QA model.

3.2.1 Relation Prediction Pre-Training

Our generated QA dataset contains the relation label between the source entity and the answer target . Therefore, we design a self-training task to guide the model to predict the latent relation in question, which can benefit both retriever and reader. Specifically, we adopt a linear projection layer over the token embedding to predict the relation over the WikiData relation set. The pre-training loss of relation prediction is:

Self-Distillation for Unlabelled Relation

The hyperlinks in wikipedia also provide valuable implicit information about the relations between entities. To leverage them, we use the trained relation predictor at each epoch with fixed parameter

as teacher model to assign soft label and then progressively train the relation predictor as student model based on the assigned labels in the next epoch. This approach is referred to as self-distillation in the literature DBLP:conf/cvpr/XieLHL20; DBLP:conf/nips/ChenKSNH20. We minimize this self-distillation loss as:

where sg denotes the operation of stop gradient, which avoids back propagation to the teacher network with fixed parameter . is enumerating all the relation labels.

As the relation predictor at early stages cannot give a reasonable prediction, we put a dynamic weight schedule to by a time-dependent weighting term , which ramps up from zero to one. Combing the weighted self-distillation loss with the supervised relation loss , we get the final relation loss to train the model capturing all relational facts covered in the Grounded Relational Wiki-Graph.

3.2.2 Dense Retrieval Pre-Training

The goal of dense retrieval pre-training is to get a question encoder and a passage encoder to map questions and all passages in the Wiki Corpus into an embedding space, such that each question is close to its ground-truth positive context passage in the embedding space. The objective is as follows:



is the cosine similarity between the normalized embeddings of question and passage.

Two-Level Negative Passage Sampling.

As we cannot enumerate all other passages in the denominator of Eq(1), we need to sample a set of negative passages for contrastive learning. Previous studies DBLP:journals/corr/abs-2004-04906 have revealed that it is essential that the sampled negative passages should be hard enough to train the retriever. As the question and passage embeddings are encoded independently, DPR can efficiently calculate the similarity of each question to all passages in the batch via dot product. Based on this property, as long as the passages within a batch are similar to each other, they serve the hard cases of negative passages to others. We thus propose a two-level negative passage sampling strategy to construct hard cases for training the retriever in the following.

We first sample at the level of entity. Given a set of randomly sampled entities, we adopt random walk from these seed entities over the Grounded Relational Wiki-Graph to get entities. As the connected entities have a relationship, their true passages are also semantically similar, and thus serve as good negative samples. We then conduct sampling at the level of passage. For each source entity with positive passage , we randomly pick other passages from the same Wiki-page to form a negative passage set . These negative passages are similar to , as they all describe the same entity .

After we collect both the positive and negative passages for all the entities, we use the passage encoder to get a passage embedding matrix with dimension . We also use question encoder to get question embedding matrix with dimension . We then get a similarity matrix with dimension , in which the diagonal entry corresponds to the similarity between question and its positive passage. We thus calculate the retrieval loss with in-batch negative samples via:

Masking Source Entity.

As the true passage might contain the name of source entity . We mask out all the tokens of from the extracted passages, so that the model is required to understand the passages for correct retrieval instead of exploiting a shortcut.

3.2.3 Reading Comprehension Pre-Training

The goal of reading comprehension pre-training is to get a neural reader that re-ranks the top-

retrieved passages and extracts an answer span from each passage as the answer. The probability of a passage contains the target answer

, and each token in the selected passage being the starting/ending positions of an are defined as:

where L are linear project layers with different parameters. Note that the re-ranking module adopts cross-attention over questions and passages rather than the dot product of two independently encoded embedding used in retriever. For each QA pair , we select other passages in wiki-page of entity as negative passages, and maximize . Then, we calculate and and maximize the probability for the ground-truth span of target answer . Combing the passage re-ranking and span extraction objectives, we get reading-comprehension loss .

4 Experiments

In this section, we evaluate RGPT-QA on three open-domain QA datasets: Natural Questions (NQ), Trivia QA and Web Questions (WQ).

4.1 Experiment Settings

We follow the pre-processing procedure described in DPR DBLP:journals/corr/abs-2004-04906 for a fair comparison. We use the English Wikipedia from Dec. 20, 2018 and split each article into passages of 100 disjoint words as the corpus. For each question in all the three datasets, we use a passage from the processed Wikipedia which contains the answer as positive passages. We evaluate the QA system by Exact Match (EM) Accuracy on the correct answer.

Our RGPT-QA could be integrated with any open-domain QA system. In this paper, we incorporate it with the recently developed QA system, Dense Passage Retriever (DPR) DBLP:journals/corr/abs-2004-04906 to evaluate our pre-training framework. The DPR model uses the RoBERTa-base (d=768, l=12) model as the base encoder. We first pre-train the retriever and reader in DPR using RGPT-QA. For retriever, we use the negative passage sampling strategy (c.f. Sec. 3.2.2), with initial entity size set to be 12, batch size of 128 and the hard negative passage number of 2. For reader, we randomly sample 64 source entities per batch to calculate the loss. For each entity, we sample 2 hard negative passages for re-ranking. We pre-train both the retriever and reader for 20 epochs using AdamW optimizer and a learning rate warm-up followed by linear decay. Pre-training is run on 8 Tesla V100 GPUs for two days. After the pre-training, we fine-tune the retriever and reader on each QA dataset following the same procedure and hyper-parameters described in DPR DBLP:journals/corr/abs-2004-04906.

QA System Name Pre-Training NQ Trivia QA WQ
Task for QA (58.9k/3.6k) (60.4k/11.3k) (2.5k/2k)
Supervised BM25+BERT DBLP:conf/acl/LeeCT19 - 26.5 47.1 17.7
HardEM DBLP:conf/emnlp/MinCHZ19 - 28.1 50.9 -
GraphRetriever DBLP:journals/corr/abs-1911-03868 - 34.5 56.0 36.4
PathRetriever DBLP:conf/iclr/AsaiHHSX20 - 32.6 - -
DPR DBLP:journals/corr/abs-2004-04906 - 41.5 56.8 34.6
Pre-Trained for QA T5 (large)  DBLP:journals/jmlr/RaffelSRLNMZLL20 T5 (Multitask) 29.8 - 32.2
ORQA DBLP:conf/acl/LeeCT19 ICT 33.3 45.0 36.4
REALM DBLP:journals/corr/abs-2002-08909 REALM 39.2 - 40.2
REALM DBLP:journals/corr/abs-2002-08909 REALM 40.4 - 40.7
DPR (KnowBERT DBLP:conf/emnlp/PetersNLSJSS19) Entity Linking 39.1 56.4 34.8
DPR (KEPLER DBLP:journals/corr/abs-1911-06136) TransE 40.9 57.1 35.2
DPR (Unsup.QA DBLP:conf/acl/LewisDR19) Cloze Translation 41.9 57.3 36.5
Ours, DPR (RGPT-QA) RGPT-QA 43.7 59.2 40.9
Table 2: End-to-end QA Exact Match Accuracy (%) on test sets of three Open-Domain QA datasets, with the number of train/test examples shown in paretheses below. All the results except the last four rows are copied from the original papers. “–” denotes no results are available. Models in the first block are initialized by BERT/RoBERTa and then directly fine-tuned on the supervised QA datasets. While models in the second block are initialized by RoBERTa and then tuned on some QA pre-training tasks first, and then fine-tuned on the supervised QA datasets.
QA Pre-Training Baselines.

We compare RGPT-QA with three recently proposed pre-training methods for open-domain QA.

T5 DBLP:journals/jmlr/RaffelSRLNMZLL20 adopts multiple generative tasks to pre-train a generative model. The fine-tuned QA models directly generate answers without needing an additional retrieval step.

ORQA DBLP:conf/acl/LeeCT19 adopts a Inverse Cloze Task (ICT) to pre-train retriever, which forces each sentence’s embedding close to context sentences.

REALM DBLP:journals/corr/abs-2002-08909 incorporates a retriever as a module into language model and trains the whole model over masked entity spans.

We directly report the results listed in their papers as they follow the same experiment settings.

We also add two knowledge-guided language models as baselines. Though not targeted at QA problem, these two methods are both designed to capture structured knowledge.

KnowBERT DBLP:conf/emnlp/PetersNLSJSS19 adds entity embedding to each entity mention in text, and adopts the entity linking objective to pre-train the model.

KEPLER DBLP:journals/corr/abs-1911-06136 uses Knowledge Embedding objective, i.e., TransE, to guide embedding encoded over entity description.

We initialize DPR base encoders by the released pre-trained models of these two work, and then fine-tune on each QA dataset with the same procedure.

We also add a Unsupervised Question Answering (Unsup.QADBLP:conf/acl/LewisDR19 as a baseline. For each entity as the answer, Unsup.QA selects a passage containing the entity as context passage and a cloze question. The cloze question is later re-written by a machine translator to natural language. We use the generated QA dataset to pre-train both the retriever and reader of the DPR framework.

4.2 Experimental Results

-2mm Pre-Train Model NQ Trivia QA WQ RoBERTa 78.4 / 63.3 79.4 / 72.6 73.2 / 58.1 KnowBERT 76.7 / 62.6 78.9 / 72.2 73.4 / 58.3 KEPLER 77.9 / 62.8 79.7 / 72.9 74.5 / 58.6 Unsup.QA 78.6 / 63.7 79.9 / 73.0 74.5 / 59.1 RGPT-QA 80.1 / 64.8 81.2 / 73.7 76.7 / 61.0

Table 3: Retrieval (left) accuracy over Top-20 results and Reader (right) Exact Match over Golden-Passages on validation sets of three Open-Domain QA datasets.

-2mm Mask NPS NQ Trivia QA WQ 44.3 59.8 41.4 39.7 56.3 34.2 43.5 58.1 39.8 43.8 59.3 40.8 43.1 58.5 40.0

Table 4: Ablation of RGPT-QA components on validation sets of three Open-Domain QA datasets. Mask: Mask target entity from question and source entity from passage; NPS: Two-level Negative Passage Sampling.
B K NQ Trivia QA WQ
128 2 80.1 81.2 76.6
128 1 79.7 80.8 76.1
64 2 79.6 80.6 75.8
64 1 79.2 80.1 75.3
Table 5: Ablation of batch size and negative sampling for retrieval pre-training. B: Batch Size; K: Number of other passages as negative sample.

Table 2 summarizes the overall EM accuracy of the QA systems on the three datasets. The DPR framework pre-trained by RGPT-QA outperforms all other open-domain QA systems. Comparing with DPR without pre-training, RGPT-QA achieves 2.2%, 2.4% and 6.3% enhancement in EM accuracy on the three datasets.

Comparing with other pre-training tasks for QA, RGPT-QA outperforms ORQA by 10.4%, 14.2% and 4.5% on the three datasets, and outperforms REALM by 3.3% and 0.2% on NQ and WQ. This demonstrates that the model performance can be enhanced by leveraging relational QA dataset guided by Grounded Relational Wiki-Graph. We provide a detailed analysis in Sec. 4.3.

KnowBERT and KEPER encode structural knowledge into pre-trained language models. Both models focus on generating meaningful entity embedding, and are not designed to infer relations between entities for question answering. From the table, KEPLER trained via TransE performs slightly better than KnowBERT trained via entity linking, and RGPT-QA outperforms KEPLER by 2.8%, 2.1%, 5.7% on the three datasets.

Similar to RGPT-QA, Unsup.QA DBLP:conf/acl/LewisDR19 also generates QA data from Wikipedia. This baseline slightly improves DPR by 0.4%, 0.5%, 1.9% on the three datasets, while our RGPT-QA outperforms it by 1.8%, 1.9%, 4.4%. As discussed in Sec 3.1, one of the main reasons that our graph-based QA generation strategy performs better is that we adopt grounded description passages and from different documents as questions and contexts. This avoids the lexical overlap problem in Unsup.QA and help model to capture relational facts.

We also show the retrieval and reader performance separately on validation sets in Table 3. Compared with DPR without pre-training, RGPT-QA improves top-20 accuracy of Retriever by 1.7%, 1.8%, and 3.5%, and improves EM accuracy of Reader by 1.5%, 1.1%, and 2.9%. Also, RGPT-QA outperforms all the other pre-training baselines. This shows that RGPT-QA improves both the retrieval and reader steps of open-domain QA.

Ablation Studies.

We then analyze the importance of each model component in RGPT-QA. One key strategy is to mask out the target answer from questions and mask out source entities from passages during retrieval training. This can avoid the model using the entity surface to find the correct passage and answer. Without using masking strategy, the average EM performance drops 5.1%. This shows that it is essential to apply the mask strategy to avoid shortcut in QA pre-training. Next, we replace the hard negative passage sampling during retrieval pre-training with random batch sampling. The average EM performance drops 1.4%, showing the importance of hard negative samples. Finally, we study the unsupervised relation loss and the supervised . Removing them leads to 0.5% and 1.3% performance drop, which shows the benefit of training the model to explicitly infer the relation from questions.

Another key component is the negative passage sampling for dense retrieval pre-training. We study how the batch size and number of negative sample influence the performance of trained retrieval. As is shown in Table 5

, increasing batch size and negative sample size can improve the performance of retriever. Even with a small batch size and negative sample, our pre-training framework could still achieves better performance against non-pretrain baseline, showing that our approach is not sensitive to these two hyperparameters.

Figure 4: Few-shot QA experiment. Figure shows EM accuracy in validation set of DPR model with and without RGPT-QA pre-training, fine-tuned with different percentage of data on Natural Questions.
Figure 5: Long-tail relation experiment. EM accuracy of questions in validation set with different relation frequency in training set.
Few-Shot QA Performance.

We analyze the improvement of RGPT-QA when only a few labelled training samples are available. We fine-tune DPR initialized by RGPT-QA on subset of Natural Questions with different percentages. As is shown in Figure 4, RGPT-QA consistently outperforms DPR without pre-training, and the improvement is more significant with small data. Specifically, when only 0.5% (594) labelled QA pairs are provided, the DPR pre-trained by RGPT-QA can still achieve 26.0% Val EM accuracy, significantly higher than 9.4% achieved by the DPR without pre-training. The results show that RGPT-QA provides a good initialization for QA systems and reduce the requirement of large human-annotated QA dataset.

4.3 Generalization for long-tail relations.

As pointed out in Section 2, existing QA datasets suffer high relation bias, and thus a QA model trained on these datasets cannot generalize well to questions with long-tail relations. We thus analyze whether our RGPT-QA can remedy this issue. As is shown in Figure 5, the performance improvement of RGPT-QA against the supervised baseline is much more significant for the questions with infrequent relations. Specifically, for all relations appear less than 5 times in training set, the average EM accuracy of RGPT-QA is 33.3%, significantly higher than 22.4% achieved by DPR without pre-training. This indicates that our relation QA generation method could indeed improve the performance on QA pairs with long-tail relations. Detailed prediction results are shown in Table 11 in Appendix.

5 Related Works

Unsupervised QA via Question Generation

To train a QA system without human annotation of QA pairs, Unsupervised QA has been proposed by DBLP:conf/acl/LewisDR19 to generate synthetic data for training QA models. DBLP:conf/acl/LewisDR19 synthesize the QA data by: 1) run NER or noun chunkers over randomly sampled English Wikipedia paragraphs to extract ; 2) Treat the paragraphs surrounding the answer as ; 3) Treat the context as clozestyle question and feed into a unsupervised machine translator to generate . Some follow-up works also utilize template DBLP:conf/acl/FabbriNWNX20 and pre-trained language model DBLP:conf/emnlp/PuriSSPC20

over masked cloze-style questions for more human-readable questions. These cloze-style unsupervised QA methods achieve promising performance than previous heuristic QA baselines but underperform supervised ones. The main limitation is that the question is generated with the masked context as input, resulting in severe overlap of lexicon and word surface with the context. Consequently, the QA model might utilize the lexical pattern as a shortcut to find the answer. To address the problem of context-question lexical overlap,

DBLP:conf/naacl/DhingraPR18 assume each article has an introductory paragraph, and use this paragraph to generate answer. DBLP:conf/acl/LiWDWX20 retrieve the Wikipedia cited document as context, DBLP:journals/corr/abs-2010-12623 leverage structured tables to extract key information from context, with which to synthesize questions.

To tackle the challenges in previous studies, our framework propose to leverage the Wikipedia hyperlinks and Wikidata relations as the bridge to connect two entities with linked descriptions. With one description as question and the other as context, the question and context are semantically relevant and lexical different, which naturally solve the problem without involving any additional module.

Knowledge-Guided Pre-Training

Recently, researchers investigated to inject structured knowledge into pre-trained language models.  DBLP:conf/acl/ZhangHLJSL19 and  DBLP:conf/emnlp/PetersNLSJSS19 propose to add entity embedding to each entity mentions in text, and add entity linking objective to guide model capture structured knowledge. DBLP:journals/corr/abs-1911-06136 encode entity text description as entity embeddings and train them via TransE objective. Though these work show improvements over several natural language understanding tasks, they are not dedicated to open-domain question answering tasks.

There are also several pre-training studies for QA. For retrieval, DBLP:conf/acl/LeeCT19 propose an inverse cloze task, which treats a random sentence as query and the surrounding contexts as ground-truth evidence to train a QA retrieval model. DBLP:conf/icml/GuuLTPC20 propose to explicitly add a retriever module in the language model to train the retriever via language modelling pre-training. For reader, DBLP:conf/iclr/XiongDWS20 propose to a weakly supervised pre-training objective. They construct some fake sentences by replacing the entities in a sentence with the other entities of the same type, and train the model to discriminate original sentence from the fake ones. DBLP:journals/corr/abs-2007-00849 incorporate the knowledge graph triplets into language model, so the model could utilize the triplets to predict correct entity. DBLP:journals/corr/abs-2102-07043 extend this work by learning a virtual knowledge base by inferring the relation between two co-occurring entity pairs.

Compared with these works, our RGPT-QA mainly differs in: 1) We do not change the base QA model, so the pre-training framework could be applied to any QA systems. 2) We explicitly model the relations between entities, which proves to benefit QA pairs with less frequent relation patterns.

6 Conclusion

In this paper, we propose a simple yet effective pre-training framework RGPT-QA. We leverage both the Wikipedia hyperlinks and Wikidata relation triplets to construct Grounded Relational Wiki-Graph, based on which we generate relational QA dataset. We then pre-train a QA model to infer the latent relation from the question, and then conduct extractive QA to get the target answer entity. RGPT-QA improves the performance of the state-of-the-art QA frameworks, especially for questions with long-tail relations.


This work was partially supported by NSF III-1705169, NSF 1937599, DARPA HR00112090027, Okawa Foundation Grant, and Amazon Research Awards.