DeepAI
Log In Sign Up

Robust Domain Adaptation for Machine Reading Comprehension

09/23/2022
by   Liang Jiang, et al.
0

Most domain adaptation methods for machine reading comprehension (MRC) use a pre-trained question-answer (QA) construction model to generate pseudo QA pairs for MRC transfer. Such a process will inevitably introduce mismatched pairs (i.e., noisy correspondence) due to i) the unavailable QA pairs in target documents, and ii) the domain shift during applying the QA construction model to the target domain. Undoubtedly, the noisy correspondence will degenerate the performance of MRC, which however is neglected by existing works. To solve such an untouched problem, we propose to construct QA pairs by additionally using the dialogue related to the documents, as well as a new domain adaptation method for MRC. Specifically, we propose Robust Domain Adaptation for Machine Reading Comprehension (RMRC) method which consists of an answer extractor (AE), a question selector (QS), and an MRC model. Specifically, RMRC filters out the irrelevant answers by estimating the correlation to the document via the AE, and extracts the questions by fusing the candidate questions in multiple rounds of dialogue chats via the QS. With the extracted QA pairs, MRC is fine-tuned and provides the feedback to optimize the QS through a novel reinforced self-training method. Thanks to the optimization of the QS, our method will greatly alleviate the noisy correspondence problem caused by the domain shift. To the best of our knowledge, this could be the first study to reveal the influence of noisy correspondence in domain adaptation MRC models and show a feasible way to achieve robustness to mismatched pairs. Extensive experiments on three datasets demonstrate the effectiveness of our method.

READ FULL TEXT VIEW PDF
02/24/2021

OneStop QAMaker: Extract Question-Answer Pairs from Text in a One-Stop Approach

Large-scale question-answer (QA) pairs are critical for advancing resear...
08/24/2019

Adversarial Domain Adaptation for Machine Reading Comprehension

In this paper, we focus on unsupervised domain adaptation for Machine Re...
03/16/2022

Synthetic Question Value Estimation for Domain Adaptation of Question Answering

Synthesizing QA pairs with a question generator (QG) on the target domai...
04/07/2020

Variational Question-Answer Pair Generation for Machine Reading Comprehension

We present a deep generative model of question-answer (QA) pairs for mac...
11/13/2019

Unsupervised Domain Adaptation on Reading Comprehension

Reading comprehension (RC) has been studied in a variety of datasets wit...
10/06/2020

PolicyQA: A Reading Comprehension Dataset for Privacy Policies

Privacy policy documents are long and verbose. A question answering (QA)...
05/11/2020

A Self-Training Method for Machine Reading Comprehension with Soft Evidence Extraction

Neural models have achieved great success on machine reading comprehensi...

Introduction

In recent, a number of domain adaptation (DA) methods Cao et al. (2020); Wang et al. (2019); Lewis et al. (2019) for machine reading comprehension (MRC) have been proposed, which usually pre-train an MRC model in a high-resource domain and then transfer it to a low-resource domain. Specifically, most existing methods Cao et al. (2020); Wang et al. (2019); Lewis et al. (2019) consist of two steps. First, they construct some pseudo QA pairs by using a pre-trained QA construction model from the available documents in the target domain. Then they fine-tune the pre-trained MRC model by using the constructed pairs.

Although these methods have achieved promising results, almost all of them ignore the mismatched QA pairs, i.e., the irrelevant QA pairs which are wrongly treated as positive. In the scenario of MRC, such a so-called noisy correspondence (NC) issue Huang et al. (2021)

is caused by the following reasons. First, the domain adaptation methods for MRC often construct pseudo QA pairs by using the documents in the target domain which does not contain natural questions. As a result, the generated questions will be probably irrelevant to the answers. Second, the QA construction model is pre-trained in the source domain and then directly applied to the target domain without fine-tuning. In consequence, such a domain shift issue will lead to noisy correspondence. A toy example of NC is shown in Fig. 

1(a), and more real-world samples generated by the existing works refer to Fig. 4. Notably, the NC problem is remarkably different from the well-studied noisy labels. To be specific, noisy labels generally refer to the category-level annotation errors of a given data point, whereas here NC refers to the mismatched relationship between two data points. Undoubtedly, NC will degenerate the performance of the MRC model, which is however neglected so far as we know.

(a) Noisy Labels vs. Noisy Correspondence
(b) Existing Works vs. Our Work
Figure 1: (a) Noisy Labels vs. Noisy Correspondence. The noisy labels refer to the errors in the category annotation of data samples caused by human annotation. The noisy correspondence here refers to the misalignment between two data points generated by the model itself. (b) An example of the document along with dialogue, and the difference between our method and the existing ones. The dialogue is the conversations between the questioner and the answerer (the customer and customer service here) about the document. In other words, the questioner raises a question about the document, and the answerer answers it by referring to the document. Hence, the dialogues have natural QA-form conversations that is helpful in QA pair construction.

To sum up, domain adaptation methods for MRC will face the NC challenge caused by i) the unavailable QA pairs in the target documents, and ii) domain shift during applying the QA construction model to the target domain. In this paper, we propose to construct more credible QA pairs by additionally using document-associated dialogues to fill the vacancy of the natural question information in the document. In addition, our method will fine-tune the QA construction model in the target domain with the help of the MRC model feedback. For clarity, we summarize the major differences between our MRC method and the existing ones in Fig. 1(b) by taking the customer service as a showcase without loss of generality. As shown, in real-world applications, the customer service usually answers customers’ questions by referring to the documents, forming an associated dialogue. Existing works only use the sole document for QA construction and do not further fine-tune the QA construction model in the target domain. In contrast, our work leverages both documents and the associated dialogues for QA construction. Dialogues are the conversation corpus between the questioners and answerers, which naturally preserves QA-similar chats and thus are more credible for QA construction. Moreover, our method will use the feedback of the MRC model on constructed QA pairs to optimize the QA construction model, thus alleviating the domain shift issue and improving the QA quality.

In practice, however, difficulties arise when an attempt is made to apply the above approaches. Specifically, although dialogues are more credible than documents for the QA construction, they still contain a huge number of irrelevant and discontinuous conversations. In other words, the above QA construction method could partially alleviate but still cannot solve the NC problem. As shown in Fig. 1(b), the irrelevant conversation is about the irrelevant conversations to the document, e.g., there are some greetings or other chats which are irrelevant w.r.t the document. The discontinuous conversation is that the question and answer are not exactly aligned in a single round of chat due to the complexity in interaction, e.g., the customer may raise a new question before receiving the answer for the last ones. Besides the challenges rooted in the dialogue data, another difficulty is how to fine-tune the non-differentiable QA construction model in the target domain to alleviate the domain shift issue. In brief, the existing works often generate QA pairs by resorting to the discrete sampling operator which hinders the optimization of the QA construction model.

To overcome the above challenges in data quality and model optimization, we propose a novel domain adaptation method for MRC, dubbed Robust Domain Adaptation for Machine Reading Comprehension (RMRC) which consists of an answer extractor (AE), a question selector (QS), and an MRC model. In brief, RMRC leverages both documents and the associated dialogue for MRC transfer by i) constructing QA pairs via the AE and the QS, and ii) alternative training the MRC model and the QS. In the first stage, for a given document, RMRC extracts the candidate answers from dialogues and filters out unrelated ones in terms of the estimated relevance using the pre-trained AE, thus alleviating the NC issue caused by irrelevant chats. After that, for the extracted answer, RMRC seeks the most related questions that appeared in multiple rounds of chats via the pre-trained QS, thus tackling the NC problem caused by discontinuous conversations. In the second stage, RMRC optimizes the MRC model and the QS in an alternative fashion, which will favor MRC transfer. In detail, the domain shift could be alleviated and accordingly the NC problem is tackled by optimizing the QS using the feedback of the MRC model. Note that, as the QS is non-differentiable and cannot be directly optimized by back-propagation, we propose a novel reinforced self-training optimization method by recasting the model evaluation on the constructed QA pairs as a training reward.

The main contributions and novelty of this paper could be summarized as below.

  • This work could be the first successful attempt to study the NC problem that is common but ignored in existing domain adaptation methods for MRC.

  • To solve the NC problem, we propose to leverage both the document and associated dialogue for MRC model training. To the best of our knowledge, this is the first method on study how to leverage the dialogue in the domain adaptation for MRC.

  • To implement robust domain adaptation method for MRC, RMRC consists of the AE, QS, and MRC model. Thanks to the reinforced self-training optimization method, the QS could be fine-tuned with the MRC model feedback, thus further alleviating the influence of NC.

Related Work

In this section, we will briefly introduce some recent developments in machine reading comprehension, the domain adaptation for MRC, and noisy label learning.

Machine Reading Comprehension

Machine reading comprehension aims to read the documents and answer questions about a document. Thanks to the collected benchmark datasets like SQuAD Rajpurkar et al. (2016), CoQA Reddy et al. (2019), and QuAC Choi et al. (2018), MRC has made great progress in recent years, and even surpasses the human level Seo et al. (2016); Yu et al. (2018); Devlin et al. (2018); Zhang et al. (2020); Jiang and Zhao (2018); Gao et al. (2019). Seo et al. (2016) proposed BiDAF which leverages RNN and bi-directional attention mechanism between question and document to achieve promising performance in machine reading comprehension. QANet Yu et al. (2018) used CNN rather than RNN to better capture local information in question and documents. BERT Devlin et al. (2018) proposes to use a large amount of unsupervised corpus for pre-training and successfully improve performance on many downstream NLP tasks including MRC.

Domain Adaptation

Domain adaptation (DA) is a well-developed technique that aims to transfer knowledge in the presence of the domain gap, i.e. making a model pre-trained in the source domain generalizable to the target domain Zhang et al. (2018, 2021); Ding et al. (2018); Hu et al. (2018); Kan et al. (2015); Li et al. (2013). The existing domain adaptation methods for MRC could be roughly grouped into two categories:

1) Model generalization. It aims to improve the generalization capability of the model trained on the source domain to the target domain Li et al. (2019); Su et al. (2019); Baradaran and Amirkhani (2021). For example, Su et al. (2019) propose to use a pre-trained model trained on multiple MRC datasets simultaneously to improve the generalization ability on the new domains. Baradaran and Amirkhani (2021) propose to ensemble the models individually trained on different datasets to improve the generalization ability. However, this kind of methods do not leverage the available information on the target domain, thus obtaining less satisfactory performance.

2) QA generation. It often utilizes the target-domain documents to construct QA pairs to fine-tune the MRC models Du et al. (2017); Wang et al. (2019); Lewis et al. (2019); Cao et al. (2020). For example, Du et al. (2017) propose to generate questions on the target domain using a Seq2seq question generator that trained on the source domain, and then fine-tune the MRC model with the pseudo QA pairs. Cao et al. (2020) propose to add a self-training step to filter out low quality QA pairs using the estimated scores of each constructed pairs through the MRC model. Though these methods have made great progress, they still suffer from the noisy correspondence problem as discussed in the introduction, thus leading to sub-optimal performance in real-world applications. Unlike these works, to address the NC problem, this paper proposes to construct more credible QA pairs by additionally using the dialogue, and optimize the MRC model and the question selector in an alternative fashion to alleviate the domain shift of the QS to the target domain.

Learning with Noisy Labels

To alleviate even eliminate the influence of noisy labels, many methods have been proposed to achieve robust classification results (Arazo et al., 2019; Song et al., 2020; Liu and Tao, 2015; Xia et al., 2020; Bai et al., 2021; Luo et al., 2021). Currently, the existing works often resort to sample selection for achieving noise robustness. To be specific, sample selection methods seek to select the clean samples from the noisy dataset for training. For example, Arazo et al. (2019) proposes to select the samples with small loss as the clean samples. Moreover, to further enhance the clean sample selection capacity, Co-teaching methods (Han et al., 2018; Yu et al., 2019) leverage two individual trained networks to filter out noises in an alternative manner. In recent, PES (Bai et al., 2021)

proposes a progressive early stopping strategy in the semi-supervised learning framework by treating the clean and noisy samples as the labeled and unlabeled data. In addition, some very recent works 

Yang et al. (2021); Huang et al. (2021) study the paradigm of noisy correspondence like this paper. Although the works share some similarity with this work, they are significantly different in motivation, application, and method.

Figure 2: Overview of our method. RMRC consists of an answer extractor (AE), a question selector (AS), and an MRC model. In the first stage, AE and QS construct the QA pairs in the target domain by leveraging both the documents and dialogues. In the second stage, RMRC optimizes the MRC model and QS in an alternative fashion.

Method

In this section, we elaborate on the proposed Robust Domain Adaptation for Machine Reading Comprehension (RMRC).

Problem Definition

For a given document and a related question, MRC aims to find a text span from the document as the corresponding answer. To overcome the data scarcity issue in MRC, some domain adaptation methods are proposed to transfer an MRC model pre-trained in the source domain to the target domain . Different from existing domain adaptation methods for MRC, we utilize the documents along with the associated dialogues in to construct QA pairs for model fine-tuning. Formally, given a pre-trained MRC model and documents in the target domain, where each document is associated with a dialogue set generated by the questioners and answerers, we aim to transfer from to . For each dialogue , it contains chats, where denotes the speaker of the -th chat, and indicate the role of answerer and the questioner, respectively.

For clarity, we first provide an overview of the proposed RMRC and then introduce its components one by one. As shown in Fig. 2, RMRC consists of an answer extractor (AE), a question selector (QS), and an MRC model which are applied to the following two stages: 1) QA construction with AE and QS. For each given document and the associated dialogues, the AE first splits the document into a set of candidate answers. Then, the AE outputs the pseudo answers by filtering out the irrelevant answers in terms of the relevance to the corresponding chat. After that, for each given pseudo answer and the related chat, QS selects multiple chats located before the answer-related chat in the dialogue as the candidate questions. Finally, the most related candidate questions are concatenated as pseudo questions. 2) Alternative training of the MRC model and QS. For a given pseudo question and document, RMRC uses the MRC model to obtain the corresponding answer. After that, the obtained answer is used to optimize the MRC model via a cross-entropy loss and the QS via our novel reinforce self-training optimizer.

Answer Extractor

The AE is designed to extract answers by finding the most similar text span from the document in terms of the answerer’s response contained in the dialogue. Formally, with the document set and associated dialogue , AE extracts the corresponding answer by filtering out the unrelated answers. Specifically, we first split all documents in into tokens via

(1)

where is the operator to extract the candidate answers from , denotes the maximum token number of the candidate answers, denotes the -gram token set from document . In other words, we will extract tokens for answer extraction. With the candidate answers, a text matching function is developed to find the best matched answer to which denoted as , i.e.,

(2)

where

is the extracted n-gram token in

, are the best matched answer and the corresponding document to

. In the equation, we compute the cosine similarity

between the extracted features of and from a pre-trained BERT Devlin et al. (2018). However in the real-world dialogues, the conversation contains not only the document-related answers but also some unrelated answers like greetings as shown in Fig. 1(b). Hence, to filter out these unrelated answers, we select the answer whose matching score is larger than a given threshold . Formally,

(3)

Question Selector

For a given answer , we pass it and its associated document along with the corresponding chat through the QS to find the corresponding question from the dialogue. In detail, we first obtain the candidate questions by selecting multiple chats located before in the dialogue, i.e.,

(4)

where denotes closest chats located before from questioner, is the maximal question selection range which is fixed to throughout our experiments. Such a question selection strategy is based on the observation that the corresponding questions only exist before the answers and are more irrelevant as they are further away from the answer. For each , we then compute the relevance score between the answer and the question by

(5)

where transforms

into a hidden vector for relevance prediction,

denotes the concatenation operator. Note that, the relevance score could be regarded as the conditional probability of to be the corresponding question of the given answer , i.e., . To further alleviate the aforementioned discontinuous conversation issue in the real world (e.g., the questioner may raise a new question before the last one is answered.), we concatenate the most related questions in terms of the relevance score w.r.t. , i.e.,

(6)

where denotes -nearest neighbors of in in terms of the relevance score. Accordingly, the probability of to be the corresponding question of is formulated as,

(7)

MRC Model Training

With the constructed pseudo QA pairs and the corresponding document , we fine-tune the pre-trained MRC model to improve its generalization to the target domain. In detail, we first embed the question and associated document into a hidden space with a BERT encoder, i.e.,

(8)

Then, is utilized to predict the positions of the corresponding answer in the document. Specifically,

(9)

where and denote the probability of each token being the start and end positions of the answer in , respectively. and are two one-layer feed-forward networks with parameters and , respectively. Finally, we use the cross entropy between the predicted probability and the ground truth as the training loss for our MRC model, i.e.,

(10)

where and are two one-hot vectors which denote the start and end positions of in .

Reinforced Self-training for QS

As the QS is pre-trained on the source domain, the noisy correspondence will be inevitably introduced when the QS is applied to the target domain for constructing QA pairs. To address this domain shift issue, we propose to optimize the QS in the target domain by recasting the model evaluation on the constructed pseudo QA pairs as a training reward. Specifically, for a given pair , we first obtain the predicted answer by

(11)

where and are the model outputs defined in Eq. 9, and denote the start and end token of the predicted answer, respectively. By using the f1-score as the quality evaluation score of the constructed QA pairs, it is expected to generate QA pairs with high f1-score, i.e.,

(12)

where denotes the parameters of QS. As shown in Eq. 6, the QA construction adopts the selection and concatenate operators via

sampling technique which is discrete. To overcome this non-differentiable problem, we adopt policy gradient based reinforcement learning (REINFORCE) 

Williams (1992) to approximate the gradients w.r.t. . Specifically, let denote the objective function (Eq. 12), its gradient can be approximated as below:

(13)

where transforms the question into the predicted answer via the MRC model, and denotes for simplicity, and is conditional probability of to be the corresponding question of . The detail of the gradient approximation is provided in the Supplementary Material. With the approximated gradients

, the loss function for QA could be rewritten by,

(14)

where denotes the relevance defined in Eq. 6. As only one sample per reward estimation is used, we use

as the baseline score for reducing variance like 

Williams (1992).

Experiments

In this section, we evaluate the RMRC on three datasets by comparing it with three MRC domain adaptation methods. The code and used datasets will be released soon.

Data Size BERT-S UQACT CASe AdaMRC RMRC-fix RMRC
EM F1 EM F1 EM F1 EM F1 EM F1 EM F1
d-QuAC ALL 3.36 16.36 6.81 19.05 6.79 19.33 5.44 18.87 7.45 23.13 8.27 24.46
5000 - - 4.26 17.88 4.85 17.92 4.57 18.01 5.85 21.25 6.37 22.27
1000 - - 3.62 16.84 3.91 17.55 4.12 17.34 4.42 19.99 5.67 21.28
d-CoQA ALL 12.50 37.80 13.92 40.18 17.01 43.27 14.82 39.92 20.85 49.09 20.97 49.35
5000 - - 12.86 38.27 14.09 39.52 13.21 38.88 18.30 45.49 17.63 44.94
1000 - - 11.45 35.43 12.63 38.92 12.39 38.11 15.43 42.75 16.08 42.27
Table 1: Results on d-QuAC and d-CoQA. Bold values indicate the best performance.
Model EM F1
BERT-S Devlin et al. (2018) 1.81 32.62
UQACT Lewis et al. (2019) 5.32 42.23
CASe Cao et al. (2020) 5.75 44.28
AdaMRC Wang et al. (2019) 7.20 43.11
RMRC-fix 9.25 52.08
RMRC 9.28 55.67
Table 2: Results on Alipay dataset.
Model EM F1
BERT-S 1.81 32.62
RMRC w/o 8.44 54.13
RMRC w/o Answer Filtering 2.10 35.05
RMRC w/o Question Fusing 8.03 53.71
RMRC w/o QS Training 8.52 51.72
RMRC w/ Confidence-based Selector 8.76 52.34
RMRC w/ CE Reward 7.71 55.52
RMRC 9.28 55.67
Table 3: Performance of different variants of RMRC.

Datasets

We pretrain the MRC model on the SQUAD dataset 

Rajpurkar et al. (2016) and fine-tune it on three target datasets including two public datasets (QuAC Choi et al. (2018) and CoQA Reddy et al. (2019)) and one real-world dataset from Alipay. Note that, as the real-world Alipay data is in Chinese, we use another Chinese corpus instead of SQUAD for pre-training MRC model, i.e., a collection of CMRC Cui et al. (2018), DRCD Shao et al. (2018) and DUREADER He et al. (2017).

QuAC and CoQA with Synthetic Noises: QuAC consists of 12,567 documents, including 11,567 training documents and 1,000 testing documents. Each document is affiliated with a related dialogue. In each round of chat in the dialogue, one user asks a question about the document, and the other user answers it. In total, there are 69,109 QA pairs for training and 5,868 for testing. Similar to QuAC, CoQA is composed of a training set of 7,199 documents along with 107,285 QA pairs, and a testing set of 500 documents along with 7,918 QA pairs. As the QA pairs of QuAC and CoQA are well-matched in the dialogue, we simulate the real conversation with noisy correspondence by randomly shuffling the questions in each dialogue. Specifically, for each dialogue, we randomly move each question ahead of the corresponding answer up to rounds. We denote the shuffled datasets of QuAC and CoQA as d-QuAC and d-CoQA, respectively in the following.

Alipay Dataset with Real Noises: Alipay Dataset is collected in real-world scenarios, which contains the conversations about the marketing activities from the customer and customer service. In total, the dataset consists of 1,526 dialogues and 3,813 human-annotated QA pairs for testing.

Implementation

In our experiments, we take the widely-used BERT as the base encoder for the QS and the MRC encoder. The BERT network contains 12 hidden layers, each of which consists of 12 attention heads. For all experiments, we generate n-grams for each document by setting for Eq. 1 and set the threshold for answer filtering to and select the question by fixing in Eq. 6 to . The optimal parameters are determined by the grid search in the Alipay dataset and used for all experiments. We set the baseline score of the reward to for Eq. 14 in all experiments. For network training, we use the Adam optimizer whose learning rate is set to and for pre-training and fine-tuning, respectively. More training details are provided in the supplemental material. For evaluations, we take the Exact Match (EM) and F1-score (F1) as the performance measurements. Both the metrics are the higher the better.

(a) Influence of
(b) Influence of
(c) Influence of
Figure 3: (a-c) Influence of answer filtering threshold (), the selected question number (), and the baseline reward ().
Figure 4: Comparisons between the questions constructed by our method and existing work Lewis et al. (2019). The red number and green number denote the negative and positive reward from MRC model, respectively.

Comparison Experiments

To show the effectiveness of our method, we compare it with the baselines including a vanilla BERT model trained in the source domain (denoted by BERT-S) and three QA generation based MRC domain adaptation methods, i.e., UQACT Lewis et al. (2019), AdaMRC Wang et al. (2019) and CASe Cao et al. (2020). Note that CASe Cao et al. (2020) uses the annotated questions in the target domain for fine-tuning, while the questions are unavailable in our experiment settings. As a remedy, we use a pre-trained question generator to generate questions in the target domain for CASe. All the baselines and our method follow the same training pipeline, i.e., first pretraining the MRC model on the SQUAD data, then fine-tuning and evaluating it on the target datasets (d-QuAC, d-CoQA and Alipay). For all baselines, we conduct the experiments with the recommended parameters and report the best one. For our method, we repeat the experiments three times with different seeds and report the average. Moreover, we additionally use a variant of RMRC with a fixed QS (denoted as RMRC-fix) that keep the weights of QS fixed instead of optimizing with proposed reinforce self-training optimizer. Such a baseline could help us to understand the influence of the proposed reinforce self-training strategy.

Results on d-QuAC and d-CoQA: Table 1 shows the quantitative results on d-QuAC and d-CoQA. From the results, one could observe that all domain adaptation methods outperform BERT-S. Moreover, the proposed methods (i.e., RMRC and RMRC-fix) significantly outperform the other three domain adaptation methods. Especially, RMRC achieves performance gains of 1.48/5.13 and 3.96/6.08 in d-QuAC-ALL and d-CoQA-ALL compared with the best baseline, respectively. From the comparisons between RMRC and RMRC-fix, one could easily find the effectiveness of our QS training strategy. Furthermore, to investigate the performance of the proposed method with small training datasets, we conduct experiments on QuAC and CoQA with and training samples. In other words, only 1.45% of original training QuAC samples are used in the evaluation with d-QuAC-1000. As shown, with the data size decreasing, the performance of all methods decreases while RMRC still outperforms all baselines.

Results on Alipay Dataset: Table 2 shows the quantitative results on the Alipay dataset. From the results, one could find that RMRC outperforms all baselines by a considerable performance margin, demonstrating its effectiveness in real-world dataset. Specifically, RMRC achieves performance gains of 2.08 and 12.56 over the best baseline (CASe) in terms of EM and F1, respectively.

Ablation Study

In this section, we carry out ablation study on the Alipay dataset. In the experiments, we report the performance by removing or replacing some modules, including removing answer filtering (RMRC w/o Answer Filtering), question fusing (RMRC w/o Question Fusing), QS training (RMRC w/o QS Training); replacing f1-score reward by cross-entropy reward (RMRC w/ CE Reward), and reinforced self-training mechanism by confidence-based self-training mechanism (RMRC w/ Confidence-based Selector). As shown in Table 3, all the proposed modules are crucial to achieving encouraging results and the following conclusions could be obtained: i) All the RMRC variants outperform BERT-S, demonstrating the value of the dialogue-based QA pair construction; ii) RMRC and RMRC w/ CE Reward significantly outperform both RMRC w/o QS Training and RMRC w/ Confidence-based Selector, which shows the effectiveness of the proposed reinforced self-training algorithm for QS training; iii) RMRC that using F1 as the reward achieves better performance CE as F1 score directly measures the precision of the predicted answer.

Parameter Sensitivity Analysis

In this section, we carry out experiments on the Alipay dataset to investigate the influence of different parameters in RMRC including the answer filtering threshold in Eq. 3, the number of nearest questions in Eq. 6, and the baseline reward in Eq. 14. As shown in Fig. 3(a), the performance keeps continuously increasing with increasing and gets the best result when . As shown in Fig. 3(b), the performance of the MRC model keeps continuously increasing with increasing from to , showing that the correct question was more likely selected with more selected questions. As increases from to , the performance significantly decreases. The reason is that too many selected questions inevitably introduce more noisy pairs. As shown in Fig. 3(c), the performance of the MRC model keeps improving when increases from to , and then decreases when is larger than . The reason is that an over-high will discourage almost all the predictions for QS training, thus leading to incorrect feedback for QS.

Case Study

To investigate the effectiveness of the QS and reinforced self-training optimizer, we conduct the case study in Fig. 4

. As shown, the previous domain adaptation works will suffer from the NC problem, thus yielding irrelevant questions to the answer. In contrast, RMRC explicitly solves the NC problem by training QS with the MRC model feedback. Specifically, in the first epoch, QS would select an irrelevant question with a negative reward from the MRC model. After training with several epochs, the optimized QS could find the desirable question to the answer. Such an example intuitively demonstrates the effectiveness of our QS and the reinforced self-training algorithm.

Conclusion

This paper could be the first successful attempt to solve the noisy correspondence problem in the domain adaptation methods for MRC. Different from the well-studied noisy labels, the noisy correspondence refers to the errors in alignment instead of category-level annotation. To overcome this challenge, we propose a robust domain adaptation method for MRC with a novel reinforced self-training optimizer. Extensive experiments verify the effectiveness of the proposed method in leveraging synthesized and real-world dialogue data for MRC.

References

  • E. Arazo, D. Ortego, P. Albert, N. O’Connor, and K. McGuinness (2019) Unsupervised label noise modeling and loss correction. In

    International Conference on Machine Learning

    ,
    pp. 312–321. Cited by: Learning with Noisy Labels.
  • Y. Bai, E. Yang, B. Han, Y. Yang, J. Li, Y. Mao, G. Niu, and T. Liu (2021) Understanding and improving early stopping for learning with noisy labels. Advances in Neural Information Processing Systems 34. Cited by: Learning with Noisy Labels.
  • R. Baradaran and H. Amirkhani (2021) Ensemble learning-based approach for improving generalization capability of machine reading comprehension systems. Neurocomputing 466, pp. 229–242. Cited by: Domain Adaptation.
  • Y. Cao, M. Fang, B. Yu, and J. T. Zhou (2020) Unsupervised domain adaptation on reading comprehension. In

    Proceedings of the AAAI Conference on Artificial Intelligence

    ,
    Vol. 34, pp. 7480–7487. Cited by: Introduction, Domain Adaptation, Comparison Experiments, Table 2.
  • E. Choi, H. He, M. Iyyer, M. Yatskar, W. Yih, Y. Choi, P. Liang, and L. Zettlemoyer (2018) Quac: question answering in context. arXiv preprint arXiv:1808.07036. Cited by: Machine Reading Comprehension, Datasets.
  • Y. Cui, T. Liu, W. Che, L. Xiao, Z. Chen, W. Ma, S. Wang, and G. Hu (2018) A span-extraction dataset for chinese machine reading comprehension. arXiv preprint arXiv:1810.07366. Cited by: Datasets.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: Machine Reading Comprehension, Answer Extractor, Table 2.
  • Z. Ding, S. Li, M. Shao, and Y. Fu (2018) Graph adaptive knowledge transfer for unsupervised domain adaptation. In

    Proceedings of the European Conference on Computer Vision (ECCV)

    ,
    pp. 37–52. Cited by: Domain Adaptation.
  • X. Du, J. Shao, and C. Cardie (2017) Learning to ask: neural question generation for reading comprehension. arXiv preprint arXiv:1705.00106. Cited by: Domain Adaptation.
  • S. Gao, Z. Ren, Y. Zhao, D. Zhao, D. Yin, and R. Yan (2019) Product-aware answer generation in e-commerce question-answering. In Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining, pp. 429–437. Cited by: Machine Reading Comprehension.
  • B. Han, Q. Yao, X. Yu, G. Niu, M. Xu, W. Hu, I. Tsang, and M. Sugiyama (2018)

    Co-teaching: robust training of deep neural networks with extremely noisy labels

    .
    arXiv preprint arXiv:1804.06872. Cited by: Learning with Noisy Labels.
  • W. He, K. Liu, J. Liu, Y. Lyu, S. Zhao, X. Xiao, Y. Liu, Y. Wang, H. Wu, Q. She, et al. (2017) Dureader: a chinese machine reading comprehension dataset from real-world applications. arXiv preprint arXiv:1711.05073. Cited by: Datasets.
  • L. Hu, M. Kan, S. Shan, and X. Chen (2018)

    Duplex generative adversarial network for unsupervised domain adaptation

    .
    In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    ,
    pp. 1498–1507. Cited by: Domain Adaptation.
  • Z. Huang, G. Niu, X. Liu, W. Ding, X. Xiao, X. Peng, et al. (2021) Learning with noisy correspondence for cross-modal matching. In Advances in Neural Information Processing Systems, Cited by: Introduction, Learning with Noisy Labels.
  • Y. Jiang and Z. Zhao (2018) StackReader: an rnn-free reading comprehension model. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 32. Cited by: Machine Reading Comprehension.
  • M. Kan, S. Shan, and X. Chen (2015) Bi-shifting auto-encoder for unsupervised domain adaptation. In Proceedings of the IEEE international conference on computer vision, pp. 3846–3854. Cited by: Domain Adaptation.
  • P. Lewis, L. Denoyer, and S. Riedel (2019) Unsupervised question answering by cloze translation. arXiv preprint arXiv:1906.04980. Cited by: Introduction, Domain Adaptation, Figure 4, Comparison Experiments, Table 2.
  • H. Li, X. Zhang, Y. Liu, Y. Zhang, Q. Wang, X. Zhou, J. Liu, H. Wu, and H. Wang (2019) D-net: a simple framework for improving the generalization of machine reading comprehension. In EMNLP 2019 MRQA Workshop, pp. 212. Cited by: Domain Adaptation.
  • W. Li, L. Duan, D. Xu, and I. W. Tsang (2013) Learning with augmented features for supervised and semi-supervised heterogeneous domain adaptation. IEEE transactions on pattern analysis and machine intelligence 36 (6), pp. 1134–1148. Cited by: Domain Adaptation.
  • T. Liu and D. Tao (2015) Classification with noisy labels by importance reweighting. IEEE Transactions on pattern analysis and machine intelligence 38 (3), pp. 447–461. Cited by: Learning with Noisy Labels.
  • Y. Luo, B. Han, and C. Gong (2021)

    A bi-level formulation for label noise learning with spectral cluster discovery

    .
    In Proceedings of the Twenty-Ninth International Conference on International Joint Conferences on Artificial Intelligence, pp. 2605–2611. Cited by: Learning with Noisy Labels.
  • P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang (2016) Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250. Cited by: Machine Reading Comprehension, Datasets.
  • S. Reddy, D. Chen, and C. D. Manning (2019) Coqa: a conversational question answering challenge. Transactions of the Association for Computational Linguistics 7, pp. 249–266. Cited by: Machine Reading Comprehension, Datasets.
  • M. Seo, A. Kembhavi, A. Farhadi, and H. Hajishirzi (2016) Bidirectional attention flow for machine comprehension. arXiv preprint arXiv:1611.01603. Cited by: Machine Reading Comprehension.
  • C. C. Shao, T. Liu, Y. Lai, Y. Tseng, and S. Tsai (2018) Drcd: a chinese machine reading comprehension dataset. arXiv preprint arXiv:1806.00920. Cited by: Datasets.
  • H. Song, M. Kim, D. Park, Y. Shin, and J. Lee (2020) Learning from noisy labels with deep neural networks: a survey. arXiv preprint arXiv:2007.08199. Cited by: Learning with Noisy Labels.
  • D. Su, Y. Xu, G. I. Winata, P. Xu, H. Kim, Z. Liu, and P. Fung (2019) Generalizing question answering system with pre-trained language model fine-tuning. In Proceedings of the 2nd Workshop on Machine Reading for Question Answering, pp. 203–211. Cited by: Domain Adaptation.
  • H. Wang, Z. Gan, X. Liu, J. Liu, J. Gao, and H. Wang (2019) Adversarial domain adaptation for machine reading comprehension. arXiv preprint arXiv:1908.09209. Cited by: Introduction, Domain Adaptation, Comparison Experiments, Table 2.
  • R. J. Williams (1992) Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning 8 (3), pp. 229–256. Cited by: Reinforced Self-training for QS.
  • X. Xia, T. Liu, B. Han, C. Gong, N. Wang, Z. Ge, and Y. Chang (2020) Robust early-learning: hindering the memorization of noisy labels. In International Conference on Learning Representations, Cited by: Learning with Noisy Labels.
  • M. Yang, Y. Li, Z. Huang, Z. Liu, P. Hu, and X. Peng (2021) Partially view-aligned representation learning with noise-robust contrastive loss. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1134–1143. Cited by: Learning with Noisy Labels.
  • A. W. Yu, D. Dohan, M. Luong, R. Zhao, K. Chen, M. Norouzi, and Q. V. Le (2018) Qanet: combining local convolution with global self-attention for reading comprehension. arXiv preprint arXiv:1804.09541. Cited by: Machine Reading Comprehension.
  • X. Yu, B. Han, J. Yao, G. Niu, I. Tsang, and M. Sugiyama (2019) How does disagreement help generalization against label corruption?. In International Conference on Machine Learning, pp. 7164–7173. Cited by: Learning with Noisy Labels.
  • C. Zhang, H. Ding, G. Lin, R. Li, C. Wang, and C. Shen (2021) Meta navigator: search for a good adaptation policy for few-shot learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9435–9444. Cited by: Domain Adaptation.
  • W. Zhang, W. Ouyang, W. Li, and D. Xu (2018) Collaborative and adversarial network for unsupervised domain adaptation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3801–3809. Cited by: Domain Adaptation.
  • Z. Zhang, J. Yang, and H. Zhao (2020) Retrospective reader for machine reading comprehension. arXiv preprint arXiv:2001.09694. Cited by: Machine Reading Comprehension.