Language model pre-training has attracted wide attention and fine-tuning on pre-trained language model has shown to be effective for improving many downstream natural language processing tasks. BERT[devlin2018bert] obtained new state-of-the-art results on a broad spectrum of diverse tasks, offering pre-trained deep bidirectional representations which are conditioned on both left and right context in all layers, which is often followed by discriminative fine-tuning on each specific task, including passage re-ranking for open domain QA.
There are two limitations of using fine-tuned BERT models for re-ranking passages in QA. Firstly, passages are of variable lengths, which affects the quality of BERT-based representations. Specifically, in the fine-tuning regime of BERT for open domain QA and passage re-ranking, a representation is learnt for the entire passage given a question. While this is desirable for small passages or questions that have short and easy answers, it isn’t for instances where the passage answers a question using multiple, more complex statements. Secondly, the passage re-ranking task is unlike other QA tasks, like factoid QA and reading comprehension, in that the answers are not limited to a word, phrase or sentence. Potential answers can have varying granularity and passages are judged by annotators based on the likelihood of containing the relevant answer. Therefore, the applicability of vanilla BERT models to answering queries that span multiple sentences or might need reasoning across distant sentences in the same passage is limited.
In this paper we deal with the above problems by extending the BERT model to explicitly model sentence representations. This is realized by distilling the sentence representations from the output of the BERT block and aggregating the representations of the tokens that make a sentence. Secondly, once we have the sentence representations, we apply a Dynamic Memory Network [kumar2016ask, xiong2016dynamic]
to model sentence-wise relations for relevance estimation. We are interested in the following research questions:
By aggregating BERT representations on a sentence level and then reasoning over sentence representations, can we improve re-ranking performance?
Can we improve training efficiency by light-weight reasoning instead of fine-tuning all parameters of BERT?
We perform experimentation on three diverse open-domain QA datasets and show that the sentence-level representations improve the model’s re-ranking performance. We find that explicit sentence modeling using a DMN enables us to reason about the answers that spread across sentences. Additionally, we find that BERT-DMN, although being an extension of BERT, can be used without expensive fine-tuning of the BERT model, resulting in reduced training times. The code will be made publicly available.
2 Related Work
Recent practices in open-domain question answering (QA) can be traced to the Text Retrieval Conferences (TRECs) in the late 1990s. Voorhees99thetrec-8 defines the task of textual open-domain question answering as using a small text snippet, usually an excerpt from a document as part of a large collection that is being utilized in the process, such as web pages [kwok2001scaling]. In the last decade, the focus on open-domain question answering has shifted to the re-ranking stage, where answer identification from candidate documents is performed using learning strategies based on richer and better language understanding models [tan2016improved, tran2018multihop, wang2018evidence, wang2018r, lin2018denoising]. Our approach also tries to propose models that improve the re-ranking part of the QA pipeline. Specifically, we are different from alternate approaches that perform end-to-end question answering that requires some type of term-based retrieval technique to restrict the input text under consideration [chen2017reading, wang2017joint, kratzwald-feuerriegel-2018-adaptive].
Multiple approaches have been proposed to improve re-ranking in open-domain QA. In [tan2016improved], the authors use LSTMs to encode questions and answers and then perform attention- and CNN-based pooling in order to perform question-answer-matching; [tran2018multihop]
follows a similar idea, but produces multiple vector representations for each question and answer, which can then focus on different aspects. Other works like[lin2018denoising] aim to improve the answer selection process by filtering out noisy, irrelevant paragraphs. Afterwards, the answer is selected from the remaining, relevant paragraphs. Some works have used evidence aggregation to re-rank passages based on information from multiple other passages [wang2018evidence]
or reinforcement learning to jointly train aranking model to rank the passages and a reading model to extract the answer from a passage [wang2018r]. In [xu2019passage], the authors use weak supervision to train a BERT-based passage ranking model without any ground-truth labels.
The most recent improvement in re-ranking stage of open-domain QA comes from BERT models that have been shown to improve language understanding. Recent works have used BERT-based ranking models, dealing with efficiency [guo2020detext] and analyzing the attention mechanism [zhan2020analysis]. In [DBLP:conf/rep4nlp/PetersRS19], the authors compare the performance of BERT with and without fine-tuning on various NLP tasks. [macavaney2019contextualized:bertir] deals with combining traditional Ranking models with BERT token representations.
Neural architectures for document ranking can be roughly categorized into representation-based models for learning semantic representations of the text [Shen2014a, dssm13, Shen2014b], interaction-based models for learning salient interaction patterns from the local interactions between the query and document [KNRM17, Guo2016] or a combination of both [mitra2017learning]. Other works [matchpyramid16, Nie_ictir18, Nie_sigir_2018] try to capture hierarchical matching patterns based on n-gram matches from the local interaction matrix of the query-document. More recent approaches [pacrr17, co_pacrr_wsdm18, pacrr_drmm_18] have tried to exploit positional information and context of the query terms. Other approaches include query modeling techniques [diaz16, Zamani_16a] with a query expansion based language model (QLM) that uses word embeddings.
The usual question answering process consists of multiple stages. Given a query, a simple method (like BM25) is used to rank a number of passages with respect to the query. Next, the top- of these passages are re-ranked using a more expensive model. Finally, the top- () of the re-ranked passages are used to answer the query. This work deals with the passage re-ranking step.
BERT-based models have achieved high performance in passage re-ranking tasks. We find, however, that these models are limited: Firstly, most variants tend to rely solely on BERT’s dedicated classification output, operating under the assumption that its internal capabilities of compressing all query and passage representations into a single output are optimal. Secondly, BERT models are very large, which results in slow training.
In this paper we introduce a re-ranking approach that leverages the representations obtained from BERT and aggregates them using a Dynamic Memory Network. We describe DMNs and outline how they can be combined with BERT such that, in addition to the classification output, the query and passage representations are taken into account. Moreover, we investigate how our model can reduce training time by introducing a lite version.
3.1 Dynamic Memory Networks
In this section we briefly introduce Dynamic Memory Networks [kumar2016ask, xiong2016dynamic], which we use to aggregate BERT outputs. DMNs take as input a sequence of words , which usually represent multiple sentences such as a document or a passage, and a question . They are composed of four modules (cf. Figure 1):
The input module encodes the input words as a sequence of vector representations. The input text is represented by pre-trained word embeddings and fed into a word-level many-to-many GRU. The outputs are then used as inputs in other modules. If the input consists of a single sentence, each of the GRU outputs is used; however, if the input consists of multiple sentences, only those GRU outputs are used where corresponds to an end-of-sentence token (for example periods or question marks), while the rest is discarded. We denote the final sequence of vectors produced by the input module as .
The question module is similar to the input module, as it is used to encode the query (or question) as a fixed-size vector representation. The word embeddings are fed into a many-to-one GRU, which outputs the query representation at the end, i.e. and .
The episodic memory module maintains a number of episodes. An episode produces a memory by iterating a GRU over the fact representations from the input module, while taking the previous memory into account. For this, the GRU’s update gate is replaced by a special attention gate at each time step,
where is the candidate hidden state for the GRU’s internal update at time step . The attention gate is a function of the input and the memory and question vectors, encoding their similarities (details can be found in [kumar2016ask]). The initial memory is initialized as . The hidden state of an episode is then computed as , where is a candidate fact and denotes concatenation. The new memory value is then simply set to the last hidden state of the episode, i.e. . Finally, the output of the episodic memory module is the last output of a GRU that iterates over all memories .
The answer module generates the final output of the model and is therefore highly dependent on the task. In our case, it is a simple feed-forward layer that predicts a score to rank passages given the output of the episodic memory module.
3.2 Combining BERT and DMN
Dynamic Memory Networks have proven to be effective in QA tasks such as reading comprehension. In this paper we combine a DMN with contextualized representations, specifically the outputs of BERT, by modifying the input, question and answer module. The resulting model, BERT-DMN, takes all outputs of BERT into account (including the classification token). It processes the token-level outputs by creating query and sentence representations and reasons over them. In the final step, everything is combined to produce the final query-passage score, which is then used to rank the documents. Figure 2 shows the architecture of our approach.
3.2.1 Input and Question Module
Let the query and passage again be denoted by and . We first construct the input for BERT as
Note that and are not necessarily words, as BERT uses subword tokenization. This input format is identical to the usual way BERT is used, where the first input is a classification token, followed by two text inputs, which are separated by separator tokens.
We split the BERT output back into two chunks, where one corresponds to the query and the other one to the passage. The outputs corresponding to the [SEP] tokens are discarded. We then use the token representations output by BERT as a replacement for the word embeddings in the DMN. In practice, instead of simply using the vector corresponding to the end-of-sentence token to represent the whole sentence, we take the vectors of all tokens in this sentence and average them.
3.2.2 Answer Module
Since the original DMN model was used for reading comprehension tasks, the answer module consisted of a sequence generation network. For the re-ranking task we are only interested in predicting a score, therefore we modify the answer module: Let the final memory be a vector and the BERT output correspond to the [CLS] token. We concatenate these vectors along with the query representation and compute the final score using a feed-forward layer, i.e.
is the sigmoid function and.
4 Experimental Setup
In this section we describe the datasets we use for our experiments and the baselines. We further outline the training process.
|No. of train queries||2406||12889||808731|
|No. of test queries||200||2000||200|
|Avg. query length||10.55||8.42||6.53|
|No. of passages||33642||27413||8841823|
|Avg. passage length||47.83||103.59||64.63|
|Avg. no. of passages per query||32.95||500||1000|
|Avg. no. of relevant passages||9.6||1.66||1.69|
We conduct experiments on three diverse passage ranking datasets:
ANTIQUE [Hashemi:antique:2019] is a non-factoid question answering benchmark based on the questions and answers of Yahoo! Webscope L6. The questions were filtered to remove ones too short or duplicate. A resulting sample of question-answer pairs was then judged by crowd workers who assigned one of four relevance labels to each pair. All questions have well-formed correct grammar. For the evaluation we follow the authors’ recommendation and treat the two higher labels as relevant and the lower two labels as irrelevant.
InsuranceQA [feng2015applying] is a dataset from the insurance domain released in 2015. For this work we use the second version, which comes with a predefined train-, dev- testset. The dev- and testset include for each question the relevant answers as well as a pool of irrelevant candidate answers. For our experiments, we choose . All queries and passages in this dataset consist of gramatically well-formed sentences.
is the passage ranking dataset from the TREC deep learning track. It uses MS MARCO[nguyen2016ms], a collection of large Machine Reading Comprehension datasets released by Microsoft in 2016.111http://www.msmarco.org/ This dataset was created using real, anonymized queries from the Bing search engine. The authors automatically identified queries that represented questions and extracted passages from the top- search results. These passages then manually received relevance labels from human annotators. The result in a very large dataset with over 8M passages and 1M queries. However, a number of queries have no associated relevant passages. Because of the nature of this dataset, queries and passages are not guaranteed to be grammatically or structurally correct or even made of complete sentences.
Table 1 outlines some dataset statistics. The evaluation (except for the ANTIQUE testset) follows the telescoping setting [Matveeva06], where a first round of retrieval has already been performed to select candidate passages that are relevant to the queries, followed by a re-ranking step by our models.
Since we are mainly interested in improving the effectiveness and training efficiency of BERT-based models, the most important baseline is a vanilla BERT ranker [nogueira2019passage]. The ranking is solely based on the output corresponding to the classification token, which is transformed into a scalar score using a feed-forward classification layer. Additionally, we implement other neural baselines:
is based on bidirectional LSTMs and attention. Both query and document are encoded using a shared bidirectional many-to-many LSTM and a pooling operation (maximum or average pooling) to the LSTM outputs. Attention scores are computed using the hidden LSTM states of the document and the pooled query representation. The resulting vectors are then compared using cosine similarity after applying dropout. We set the batch size toand the number of LSTM hidden units to . We feed -dimensional pre-trained GloVe [pennington2014glove] embeddings to the shared LSTM and use a dropout rate of .
K-NRM [xiong2017end] is a neural ranking model that works via kernel pooling. Starting from pre-trained word embeddings, it builds a translation matrix, where each row contains the cosine similarities of a query word to all document words. Each row is then fed into kernel functions, and the results are pooled by summation. Finally, a single transformation with tanh activation is applied to output a score. The model is trained with a pairwise ranking loss and uses RBF kernels. We use
-dimensional pre-trained GloVe embeddings to build the translation matrix. The hyperparameters are adopted from[xiong2017end]: We set and use one kernel for exact matches, i.e. and . The remaining kernels are spaced evenly in with , , …, and . We use the Adam optimizer with a leaning rate of and and a batch size of .
Dynamic Memory Network [kumar2016ask, xiong2016dynamic] serves (in a slightly modified fashion) as the aggregation part of our model, which transforms sentence-level BERT outputs into a relevance score. We also train this model using pre-trained -dimensional word vectors in order to analyze if and how much BERT representations improve the performance. For these experiments we use the same DMN hyperparameters as in our experiments with BERT-DMN to make the results more comparable.
5.1 Training Efficiency
As previously mentioned, a drawback of BERT-based models is their training inefficiency, as the time required for even a single training epoch can be substantial, albeit a one-time cost. In order to mitigate this, we propose
. While the model architecture remains identical, the BERT layer is excluded from backpropagation, such that its weights remain frozen. This reduces the training time in two ways: The time required to complete the first epoch will be slightly lower, as the majority of the weights are excluded from the backward pass; the second and all subsequent epochs can be sped up significantly, as the BERT outputs can be cached and re-used.
5.2 Training Details
Our models are implemented using PyTorch. We use a pre-trained, uncasedmodel with encoder layers, attention heads and -dimensional vector representations. The training is done as follows: We feed all query-passage pairs through the BERT layer to obtain the token representations. We then compute the average of all vectors for each sentence to obtain the inputs for the GRU, which in turn produces representations that serve as the inputs of the episodic memory module. Similarly, we use another GRU to encode the query as a single vector. In the case of BERT-DMN, the fine-tuning of BERT and training of the DMN happens jointly. For , all weights corresponding to BERT are frozen, i.e. they remain unchanged during the optimization. BERT inputs are truncated if they exceed tokens.
The models are trained using the AdamW optimizer with the learning rate set to , following [nogueira2019passage], and a pairwise max-margin loss: Let be a query and and and passages, where is more relevant to than . The loss is computed as
where is the margin and is the model. We use and linear warm-up over the first steps ( on TREC-DL). The DMN hyperparameters are set to episodes,
-dimensional hidden representations and a dropout rate of. Dropout is applied at the DMN input, over the attention gates and before the output layer. We use a batch size of throughout our experiments. Validation is performed based on MAP on the devset. We use the same fixed random seed and thus identical training data for all experiments.
The mean reciprocal rank (MRR) is defined as
where is the set of all queries and refers to the highest rank of any relevant document for the -th query.
Similarly, mean average precision (MAP) is defined as
where is the set of all documents relevant to , is the total number of retrieved documents, is the precision and indicates the relevance of the document at rank .
In this section we present and discuss our results.
6.1 Passage Re-Ranking Performance
Table 2 outlines the passage re-ranking performance of our methods and the baselines on three datasets. It is evident that the BERT-based methods vastly outperform the other baselines on all datasets. performs noticeably worse, but still shows improvements over the non-contextual baselines. Finally, BERT-DMN improves the performance of BERT in all but one case. These results yield the following insights:
As expected, the contextual token representations obtained from BERT trump non-contextual word embeddings. Even without any fine-tuning, the BERT representations perform well.
The contextual sentence representations do in fact hold valuable information. This information is discarded by models which only use the output corresponding to the classification token. End-to-end training further improves thep performance.
As a result, the DMN profits vastly from BERT representations (), and the performance improves even more when the model is trained end-to-end (BERT-DMN).
6.2 The Effect of Fine-Tuning
In order to analyze the effect of fine-tuning the parameters of BERT, we conduct additional experiments using lite versions of BERT and BERT-DMN. The architectures and hyperparameters of these models are unchanged, however the number of trainable parameters is reduced roughly from 110M to 3M () or 1k () by freezing the BERT model itself. Table 2 shows slight performance drops of in all but one case. However, comparing it to the fine-tuned vanilla BERT model shows even smaller differences, and in some cases the performance increases. Conversely, exhibits a much higher loss of performance over BERT. This indicates that most of the information required for the task is already inherent to the pre-trained BERT model, and fine-tuning its parameters is merely required to direct it towards the desired output (usually the classification token). In order to confirm this hypothesis, we adopt a method proposed by pmlr-v119-goyal20a to measure the diffusion of information within the contextual representations output by BERT: Given a query-passage pair, we use a BERT model to obtain a representation (in our case a -dimensional vector) of each token, corresponding to either query or passage. We then use cosine similarity to compute diffusion of information in three ways:
CLS-Query: Cosine similarity between the classification token and each query token.
CLS-Passage: Cosine similarity between the classification token and each passage token.
Innerpassage: Cosine similarity between each possible pair of two passage tokens.
The results are illustrated in Figure 3 for three BERT models, one without any fine-tuning (), one with standard fine-tuning using only the classification output (BERT) and finally one fine-tuned as part of our approach (BERT-DMN). These measurements were performed on roughly of the TREC-DL testset (20k query-passage pairs). We observe that, without any fine-tuning, the outputs are rather dissimilar; with standard fine-tuning, however, the similarity of all representations vastly increases, especially within the passages. The same trend is exhibited by the model fine-tuned with BERT-DMN, but to a much lesser extend. This shows that discarding all but one output during fine-tuning leads to very high diffusion, in that all output vectors become very similar, and taking all outputs into account during fine-tuning alleviates this issue, allowing for a slight performance gain. It further suggests that is able to combine the classification output and the sentence representations, performing closely to a fine-tuned BERT model.
6.3 Training Efficiency
|2.26 5.32||2.55 5.67|
Since the performances of and BERT are comparable (cf. Table 2), can be seen as an alternative to the usual fine-tuning of a BERT model. Since the DMN layer has very few parameters compared to BERT (roughly 3M vs. 100M), the size of the model itself does not change a lot. However, exhibits noticeable improvements in training efficiency compared to fine-tuning BERT. In order to show this, we measure the number of batches per seconds for both models in Table 3. For , the first epoch is already slightly faster, as the majority of the weights are excluded from the backward pass; the second and all subsequent epochs are sped up significantly, as the BERT outputs can be cached re-used for the remainder of the training. The measurements were performed on a single non-shared NVIDIA GTX 1080Ti GPU.
7 Conclusion and Outlook
The exponential growth in the searchable web [holzmann2016dawn] has resulted in the proliferation of numerous knowledge-intensive tasks [holzmann2017exploring, singh2016expedition], of which question answering tasks are prominent [nguyen2016ms_marco, anand2020conversational]. In this paper we introduced BERT-DMN and , extensions of BERT that utilize dynamic memory networks to perform passage re-ranking. We have shown that our model improves the performance of BERT on three datasets. Moreover, performs well even without a fine-tuned BERT model, reducing the training time while incurring only a small performance hit. Our findings demonstrate that fine-tuning BERT-based models is not always necessary, as nearly the same result can be achieved using sentence-level representations.
There are many ways to extend BERT-DMN. Firstly, a common problem of over-parameterized models like BERT is that they are less interpretable. There is some initial work in the direction of understanding the rationale behind QA and passage ranking tasks by either sparsification [zhang2021explain], inspecting BERT’s parametric memory [wallat2020bertnesia], or in a post-hoc manner [zeon2019study, singh2020model:prefcov]. We see the DMN as an interpretable approach to evidence selection for question answering. The dynamic memory module in some sense iteratively computes attention on sentences that reflects their relative importance. We could use this observation to build an interpretable-by-design approach to passage ranking given questions by highlighting evidence sentences from the episodic memory module. Secondly, outside of text datasets, we envision the utility of the DMN in question answering over semi-structured data on the web like anchor text [holzmann2016tempas], semantic annotations [holzmann2017exploring], tables [fetahu2019tablenet]
and fully structured knowledge graphs. Specifically, the transitive reasoning capability is natural to structured information organized as triples or in a graph.