Multi-passage BERT: A Globally Normalized BERT Model for Open-domain Question Answering

08/22/2019
by   Zhiguo Wang, et al.
Amazon
11

BERT model has been successfully applied to open-domain QA tasks. However, previous work trains BERT by viewing passages corresponding to the same question as independent training instances, which may cause incomparable scores for answers from different passages. To tackle this issue, we propose a multi-passage BERT model to globally normalize answer scores across all passages of the same question, and this change enables our QA model find better answers by utilizing more passages. In addition, we find that splitting articles into passages with the length of 100 words by sliding window improves performance by 4 passages, multi-passage BERT gains additional 2 benchmarks showed that our multi-passage BERT outperforms all state-of-the-art models on all benchmarks. In particular, on the OpenSQuAD dataset, our model gains 21.4 F_1 over BERT-based models.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

01/16/2021

Transformer-Based Models for Question Answering on COVID19

In response to the Kaggle's COVID-19 Open Research Dataset (CORD-19) cha...
04/30/2020

Look at the First Sentence: Position Bias in Question Answering

Many extractive question answering models are trained to predict start a...
10/17/2019

Universal Text Representation from BERT: An Empirical Study

We present a systematic investigation of layer-wise BERT activations for...
05/02/2020

BERT-kNN: Adding a kNN Search Component to Pretrained Language Models for Better QA

Khandelwal et al. (2020) show that a k-nearest-neighbor (kNN) component ...
04/30/2020

Robust Question Answering Through Sub-part Alignment

Current textual question answering models achieve strong performance on ...
09/08/2017

Globally Normalized Reader

Rapid progress has been made towards question answering (QA) systems tha...
04/14/2020

A Simple Yet Strong Pipeline for HotpotQA

State-of-the-art models for multi-hop question answering typically augme...

Code Repositories

Multi-Passage-BERT

A simple implementation of Multi-passage BERT


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

BERT model Devlin et al. (2018) has achieved significant improvements on a variety of NLP tasks. For question answering (QA), it has dominated the leaderboards of several machine reading comprehension (RC) datasets. However, the RC task is only a simplified version of the QA task, where a model only needs to find an answer from a given passage/paragraph. Whereas, in reality, an open-domain QA system is required to pinpoint answers from a massive article collection, such as Wikipedia or the entire web.

Recent studies directly applied the BERT-RC model to open-domain QA Yang et al. (2019); Nogueira et al. (2018); Alberti et al. (2019). They firstly leverage a passage retriever to retrieve multiple passages for each question. During training, passages corresponding to the same question are taken as independent training instances. During inference, the BERT-RC model is applied to each passage individually to predict an answer span, and then the highest scoring span is selected as the final answer. Although this method achieves significant improvements on several datasets, there are still several unaddressed issues. First, viewing passages of the same question as independent training instances may result in incomparable answer scores across passages. Thus, globally normalizing scores over all passages of the same question Clark and Gardner (2018) may be helpful. Second, previous work defines passages as articles, paragraphs, or sentences. However, the question of proper granularity of passages is still underexplored. Third, passage ranker for selecting high-quality passages has been shown to be very useful in previous open-domain QA systems Wang et al. (2018a); Lin et al. (2018); Pang et al. (2019). However, we do not know whether it is still required for BERT. Fourth, most effective QA and RC models highly rely on explicit inter-sentence matching between questions and passages Wang and Jiang (2017); Wang et al. (2016); Seo et al. (2017); Wang et al. (2017), whereas BERT only applies self-attention layers over the concatenation of a question-passage pair. It is unclear whether the inter-sentence matching still matters for BERT.

To answer these questions, we conduct a series of empirical studies on the OpenSQuAD dataset Rajpurkar et al. (2016); Wang et al. (2018a). Experimental results show that: (1) global normalization makes QA model more stable while pinpointing answers from large number of passages; (2) splitting articles into passages with the length of 100 words by sliding window brings 4% improvements; (3) leveraging a BERT-based passage ranker gives us extra 2% improvements; and (4) explicit inter-sentence matching is not helpful for BERT. We also compared our model with state-of-the-art models on four standard benchmarks, and our model outperforms all state-of-the-art models on all benchmarks.

2 Model

Open-domain QA systems aim to find an answer for a given question from a massive article collection. Usually, a retriever is leveraged to retrieve passages for a given question , where is the -th passage, and and are corresponding words. A QA model will compute a score for each possible answer span . We further decompose the answer span prediction into predicting the start and end positions of the answer span , where and

are the probabilities of

and to be the start and end positions.

BERT-RC model assumes passages in are independent of each other. The model concatenates the question and each passage into a new sequence “[CLS] [SEP]

[SEP]”, and applies BERT to encode this sequence. Then the vector representation of each word position from BERT encoder is fed into two separate dense layers to predict the probabilities

and Devlin et al. (2018). During training, the log-likelihood of the correct start and end positions for each passage is optimized independently. For passages without any correct answers, we set the start and end positions to be 0, which is the position for the first token [CLS]. During inference, BERT-RC model is applied to each passage individually to predict an answer, and then the highest scoring span is selected as the final answer. If answers from different passages have the same string, they are merged by summing up their scores.

Multi-passage BERT

: BERT-RC model normalizes probability distributions

and for each passage independently, which may cause incomparable answer scores across passages. To tackle this issue, we leverage the global normalization method Clark and Gardner (2018) to normalize answer scores among multiple passages, and dub this model as multi-passage BERT. Concretely, all passages of the same question are processed independently as we do in BERT-RC until the normalization step. Then, is applied to normalize all word positions from all passages.

Passage ranker reranks all retrieved passages, and selects a list of high-quality passages for the multi-passage BERT model. We implement the passage ranker as another BERT model, which is similar to multi-passage BERT except that at the output layer it only predicts a single score for each passage based on the vector representation of the first token [CLS]. We also apply over all passage scores corresponding to the same question, and train to maximize the log-likelihood of passages containing the correct answers. Denote the passage score as , then the score of an answer span from passage will be .

3 Experiments

Datasets: We experiment on four open-domain QA datasets. (1) OpenSQuAD: question-answer pairs are from SQuAD 1.1 Rajpurkar et al. (2016), but a QA model will find answers from the entire Wikipedia rather than the given context. Following Chen et al. (2017), we use the 2016-12-21 English Wikipedia dump. 5,000 QA pairs are randomly selected from the original training set as our validation set, and the remaining QA pairs are taken as our new training set. The original development set is used as our test set. (2) TriviaQA: TriviaQA unfiltered version Joshi et al. (2017) are used. Following Pang et al. (2019), we randomly hold out 5,000 QA pairs from the original training set as our validation set, and take the remaining pairs as our new training set. The original development set is used as our test set. (3) Quasar-T Dhingra et al. (2017) and (4) SearchQA Dunn et al. (2017) are leveraged with the official split.

Basic Settings: If not specified, the pre-trained BERT-base model with default hyper-parameters is leveraged. ElasticSearch with BM25 algorithm is employed as our retriever for OpenSQuAD. Passages for other datasets are from the corresponding releases. During training, we use top-10 passages for each question plus all passages (within the top-100 list) containing correct answers. During inference, we use top-30 passages for each question. Exact Match (EM) and scores Rajpurkar et al. (2016)

are utilized as the evaluation metrics.

3.1 Model Analysis

No. Model EM

1
Single-sentence 34.8 44.4
2 Length-50 35.5 45.2
3 Length-100 35.7 45.7
4 Length-200 34.8 44.7
5 w/o sliding-window (same as (3)) 35.7 45.7
6 w/ sliding-window 40.4 49.8
7 w/o passage ranker (same as (6)) 40.4 49.8
8 w/ passage ranker 41.3 51.7
9 w/ passage scores 42.8 53.4
10 BERT+QANet 18.3 27.8
11 BERT+QANet (fix BERT) 35.5 45.9
12 BERT+QANet (init. from (11)) 36.2 46.4
Table 1: Results on the validation set of OpenSQuAD.

To answer questions from section 1, we conduct a series of experiments on OpenSQuAD dataset, and report the validation set results in Table 1. Multi-passage BERT model is used for experiments.

Effect of passage granularity: Previous work usually defines passages as articles Chen et al. (2017), paragraphs Yang et al. (2019), or sentences Wang et al. (2018a); Lin et al. (2018). We explore the effect of passage granularity regarding to the passage length, i.e., the number of words in each passage. Each article is split into non-overlapping passages based on a fixed length. We vary passage length among {50, 100, 200}, and list the results as models (2) (3) (4) in Table 1, respectively. Comparing to single-sentence passages (model (1)), leveraging fixed-length passages works better, and passages with 100 words works the best. Hereafter, we set passage length as 100 words.

Figure 1: Effect of global normalization.

Effect of sliding window

: Splitting articles into non-overlapping passages may force some near-boundary answer spans to lose useful contexts. To deal with this issue, we split articles into overlapping passages by sliding window. We set the window size as 100 words, and the stride as 50 words (half the window size). Result from the sliding window model is shown as model (6) in Table

1. We can see that this method brings us 4.7% EM and 4.1% improvements. Hereafter, we use sliding window method.

Datasets Quasar-T SearchQA TriviaQA OpenSQuAD
Models EM EM EM EM
DrQA Chen et al. (2017) 37.7 44.5 41.9 48.7 32.3 38.3 29.8 -
Wang et al. (2018a) 35.3 41.7 49.0 55.3 47.3 53.7 29.1 37.5
OpenQA Lin et al. (2018) 42.2 49.3 58.8 64.5 48.7 56.3 28.7 36.6
TraCRNet Dehghani et al. (2019) 43.2 54.0 52.9 65.1 - - - -
HAS-QA Pang et al. (2019) 43.2 48.9 62.7 68.7 63.6 68.9 - -

BERT (Large) Nogueira et al. (2018)
- - - 69.1 - - - -
BERT serini Yang et al. (2019) - - - - - - 38.6 46.1
BERT-RC (Ours) 49.7 56.8 63.7 68.7 61.0 66.9 45.4 52.5
Multi-Passage BERT (Base) 51.3 59.0 65.2 70.6 62.0 67.5 51.2 59.0
Multi-Passage BERT (Large) 51.1 59.1 65.1 70.7 63.7 69.2 53.0 60.9
Table 2: Comparison with state-of-the-art models, where the first group are models without using BERT, the second group are BERT-based models, and the last group are our multi-passage BERT models.

Effect of passage ranker: We plug the passage ranker into the QA pipeline. First, the retriever returns top-100 passages for each question. Then, the passage ranker is employed to rerank these 100 passages. Finally, multi-passage BERT takes top-30 reranked passages as input to pinpoint the final answer. We design two models to check the effect of the passage ranker. The first model utilizes the reranked passages but without using passage scores, whereas the second model makes use of both the reranked passages and their scores. Results are given in Table 1 as models (8) and (9) respectively. We can find that only using reranked passages gives us 0.9% EM and 1.0% improvements, and leveraging passage scores gives us 1.5% EM and 1.7% improvements. Therefore, passage ranker is useful for multi-passage BERT model.

Effect of global normalization: We train BERT-RC and multi-passage BERT models using the reranked passages, then evaluate them by taking as input various number of passages. These models are evaluated on two setups: with and without using passage scores. scores for BERT-RC based on different number of passages are shown as the dotted and solid green curves in Figure 1. scores for our multi-passage BERT model with similar settings are shown as the dotted and solid blue curves. We can see that all models start from the same , because multi-passage BERT is equivalent to BERT-RC when using only one passage. While increasing the number of passages, BERT-RC without using passage scores decreases the performance significantly, which verifies that the answer scores from BERT-RC are incomparable across passages. This issue is alleviated to some extent by leveraging passage scores. On the other hand, performance from multi-passage BERT without using passage scores increases at the beginning, and then flattens out after passage number is over 10. By utilizing passage scores, multi-passage BERT gets better performance while using more passages. This phenomenon shows the effectiveness of global normalization, which enables the model find better answers by utilizing more passages.

Does explicit inter-sentence matching matter? Almost all previous state-of-the-art QA and RC models find answers by matching passages with questions, aka inter-sentence matching Wang and Jiang (2017); Wang et al. (2016); Seo et al. (2017); Wang et al. (2017); Song et al. (2017). However, BERT model simply concatenates a passage with a question, and differentiates them by separating them with a delimiter token [SEP], and assigning different segment ids for them. Here, we aim to check whether explicit inter-sentence matching still matters for BERT. We employ a shared BERT model to encode a passage and a question individually, and a weighted sum of all BERT layers is used as the final token-level representation for the question or passage, where weights for all BERT layers are trainable parameters. Then the passage and question representations are input into QANet Yu et al. (2018) to perform inter-sentence matching, and predict the final answer. Model (10) in Table 1 shows the result of jointly training the BERT encoder and the QANet model. The result is very poor, likely because the parameters in BERT are catastrophically forgotten while training the QANet model. To tackle this issue, we fix parameters in BERT, and only update parameters for QANet. The result is listed as model (11). It works better than model (10), but still worse than multi-passage BERT in model (6). We design another model by starting from model (11), and then jointly fine-tuning the BERT encoder and QANet. Model (12) in Table 1 shows the result. It works better than model (11), but still has a big gap with multi-passage BERT in model (6) . Therefore, we conclude that the explicit inter-sentence matching is not helpful for multi-passage BERT. One possible reason is that the multi-head self-attention layers in BERT has already embedded the inter-sentence matching.

3.2 Comparison with State-of-the-art Models

We evaluate BERT-RC and Multi-passage BERT on four standard benchmarks, where passage scores are leveraged for both models. We build another multi-passage BERT for each dataset by initializing it with the pre-trained BERT-Large model. Experimental results from our models as well as other state-of-the-art models are shown in Table 2, where the first group are open-domain QA models without using the BERT model, the second group are BERT-based models, and the last group are our multi-passage BERT models.

From Table 2, we can see that our multi-passage BERT model outperforms all state-of-the-art models across all benchmarks, and it works consistently better than our BERT-RC model which has the same settings except the global normalization. In particular, on the OpenSQuAD dataset, our model improves by 21.4% EM and 21.5% over all non-BERT models, and 5.8% EM and 6.5% over BERT-based models. Leveraging BERT-Large model makes multi-passage BERT even better on TriviaQA and OpenSQuAD datasets.

4 Conclusion

We propose a multi-passage BERT model for open-domain QA to globally normalize answer scores across mutiple passages corresponding to the same question. We find two effective techniques to improve the performance of multi-passage BERT: (1) splitting articles into passages with the length of 100 words by sliding window; and (2) leveraging a passage ranker to select high-quality passages. With all these techniques, our multi-passage BERT model outperforms all state-of-the-art models on four standard benchmarks.

In future, we plan to consider inter-correlation among passages for open-domain question answering Wang et al. (2018b); Song et al. (2018).

References

  • C. Alberti, K. Lee, and M. Collins (2019) A bert baseline for the natural questions. arXiv preprint arXiv:1901.08634. Cited by: §1.
  • D. Chen, A. Fisch, J. Weston, and A. Bordes (2017) Reading wikipedia to answer open-domain questions. In ACL 2017, Cited by: §3.1, Table 2, §3.
  • C. Clark and M. Gardner (2018) Simple and effective multi-paragraph reading comprehension. In ACL 2018, Cited by: §1, §2.
  • M. Dehghani, H. Azarbonyad, J. Kamps, and M. de Rijke (2019) Learning to transform, combine, and reason in open-domain question answering. In Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining, pp. 681–689. Cited by: Table 2.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §1, §2.
  • B. Dhingra, K. Mazaitis, and W. W. Cohen (2017) Quasar: datasets for question answering by search and reading. arXiv preprint arXiv:1707.03904. Cited by: §3.
  • M. Dunn, L. Sagun, M. Higgins, V. U. Guney, V. Cirik, and K. Cho (2017) Searchqa: a new q&a dataset augmented with context from a search engine. arXiv preprint arXiv:1704.05179. Cited by: §3.
  • M. Joshi, E. Choi, D. S. Weld, and L. Zettlemoyer (2017) Triviaqa: a large scale distantly supervised challenge dataset for reading comprehension. In ACL 2017, Cited by: §3.
  • Y. Lin, H. Ji, Z. Liu, and M. Sun (2018) Denoising distantly supervised open-domain question answering. In ACL 2018, Cited by: §1, §3.1, Table 2.
  • R. Nogueira, J. Bulian, and M. Ciaramita (2018)

    Learning to coordinate multiple reinforcement learning agents for diverse query reformulation

    .
    arXiv preprint arXiv:1809.10658. Cited by: §1, Table 2.
  • L. Pang, Y. Lan, J. Guo, J. Xu, L. Su, and X. Cheng (2019) HAS-qa: hierarchical answer spans model for open-domain question answering. In AAAI 2019, Cited by: §1, Table 2, §3.
  • P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang (2016) Squad: 100,000+ questions for machine comprehension of text. In EMNLP 2016, Cited by: §1, §3, §3.
  • M. Seo, A. Kembhavi, A. Farhadi, and H. Hajishirzi (2017) Bidirectional attention flow for machine comprehension. In ICLR 2017, Cited by: §1, §3.1.
  • L. Song, Z. Wang, and W. Hamza (2017) A unified query-based generative model for question generation and question answering. arXiv preprint arXiv:1709.01058. Cited by: §3.1.
  • L. Song, Z. Wang, M. Yu, Y. Zhang, R. Florian, and D. Gildea (2018)

    Exploring graph-structured passage representation for multi-hop reading comprehension with graph neural networks

    .
    arXiv preprint arXiv:1809.02040. Cited by: §4.
  • S. Wang and J. Jiang (2017) Machine comprehension using match-lstm and answer pointer. In ICLR 2017, Cited by: §1, §3.1.
  • S. Wang, M. Yu, X. Guo, Z. Wang, T. Klinger, W. Zhang, S. Chang, G. Tesauro, B. Zhou, and J. Jiang (2018a) R3: reinforced ranker-reader for open-domain question answering. In AAAI 2018, Cited by: §1, §1, §3.1, Table 2.
  • S. Wang, M. Yu, J. Jiang, W. Zhang, X. Guo, S. Chang, Z. Wang, T. Klinger, G. Tesauro, and M. Campbell (2018b) Evidence aggregation for answer re-ranking in open-domain question answering. In ICLR 2018, Cited by: §4.
  • W. Wang, N. Yang, F. Wei, B. Chang, and M. Zhou (2017) Gated self-matching networks for reading comprehension and question answering. In ACL 2017, Cited by: §1, §3.1.
  • Z. Wang, H. Mi, W. Hamza, and R. Florian (2016) Multi-perspective context matching for machine comprehension. arXiv preprint arXiv:1612.04211. Cited by: §1, §3.1.
  • W. Yang, Y. Xie, A. Lin, X. Li, L. Tan, K. Xiong, M. Li, and J. Lin (2019) End-to-end open-domain question answering with bertserini. arXiv preprint arXiv:1902.01718. Cited by: §1, §3.1, Table 2.
  • A. W. Yu, D. Dohan, M. Luong, R. Zhao, K. Chen, M. Norouzi, and Q. V. Le (2018) QANet: combining local convolution with global self-attention for reading comprehension. In ICLR 2018, Cited by: §3.1.