During the consultation phase of primary patient care, healthcare professionals raise at least one question for every two patients . Even though they can successfully find answers to 78% of the pursued questions, they never pursue half of their questions because of time constraints and the suspicion that helpful answers do not exist, notwithstanding the availability of ample evidence [12, 5]. Additionally, searching existing resources for reliable, relevant, and high-quality information poses an inconvenience for the clinicians on account of time limitation. This phenomenon elicits the dependency on general-information electronic resources that are simple to use, such as Google . Apart from the healthcare professionals, there is also a growing public interest in learning about their medical conditions online . Nevertheless, the criteria for ranking search results by general-purpose search engines does not conform directly to the fundamentals of evidence-based medicine (EBM) and thus lacks rigor, reliability, and quality .
While traditional information retrieval (IR) systems somewhat mitigate this issue, it still requires four hours for a healthcare-information professional to find answers to queries related to complex biomedical resources . Compared to the IR systems that usually provide the users (general population or healthcare professionals) a group of documents to interpret and find the exact answers, biomedical machine reading comprehension (biomedical-MRC) systems can provide exact answers to user inquiries, saving both time and effort.
In natural language processing (NLP), machine reading comprehension (MRC) is a challenging task aiming to teach and evaluate the machines to understand user-defined questions, read and comprehend input contexts (namely, context) and return answers from them. The datasets in the MRC task consist of context-question-answer triplets where the question-answer pairs are considered labels. With the development and availability of efficient computing hardware resources, researchers have developed several state-of-the-art (SOTA) neural network-based MRC systems capable of achieving analogous or superior to human-level performance on several benchmark MRC datasets[9, 40, 13, 43, 25]. However, this achievement is highly dependent on a large amount of high-quality human-annotated datasets that are used to train these systems . For domain-specific MRC tasks, especially biomedical-MRC, building a high-quality labeled dataset, specifically, the question-answer pairs residing in the dataset requires undeniable effort and knowledge of subject matter experts. This requirement leads to smaller biomedical-MRC datasets and, consequently, unreliably poor performance on the MRC task itself . Hence, developing an approach that can effectively leverage unlabeled or small-scale labeled datasets in training the biomedical-MRC model is crucial for improving performance.
Researchers have addressed this issue by using transfer learning, a learning process that helps transfer knowledge from a source domain to a target domain . In domain-specific MRC problems such as biomedical-MRC, the source domain is usually a general-purpose domain where a large-scale human-annotated MRC dataset is available. The target domain, in this case, is the biomedical domain. In this work, we focus on transferring the knowledge from an MRC model trained on a labeled general-purpose-domain dataset to the biomedical domain where only unlabeled contexts are available. Unlabeled contexts refer to only contexts in the MRC dataset with no question-answer pairs.
Often, directly transferring the knowledge representations (learned by an MRC model) from the source to the target domain can hurt the performance of the model because of the distributional discrepancies between the data seen at train and test time . Domain adaptation, a sub-setting of transfer learning , aims at mitigating these discrepancies through simultaneous generation of feature representations that are discriminative from the viewpoint of the MRC task in the source domain and indiscriminative from the perspective of the shift in the marginal distributions between the source and target domains .
We propose Adversarial learning-based Domain adAPTation framework for Biomedical Machine Reading Comprehension (BioADAPT-MRC), a new framework that uses adversarial learning to generate domain-invariant feature representations for better domain adaptation in biomedical-MRC models. In an adversarial learning framework, we train two adversaries sequentially or simultaneously against one another to generate domain-invariant features. Domain-invariant feature representations imply that the feature representations extracted from the source- and the target-domain samples are closer in the embedding space.
While other recent domain adaptation approaches for the MRC task focus on generating pseudo question-answer pairs to augment the training data [18, 52], we utilize only the unlabeled contexts from the target domain. This property makes our framework more suitable in cases where not only human-annotated dataset is scarce but also the generation of synthetic question-answer pairs is computationally expensive, and needs further validation from domain-experts (due to the sensitivity to the correctness of the domain knowledge). Since our goal is to generate domain-invariant features, we introduce a domain similarity discriminator in the adversarial learning framework. The discriminator identifies the relative similarity between the source and target domains and thus promotes the generation of domain-invariant representations accordingly.
We validate our proposed framework on three widely used benchmark datasets from the cornerstone challenge on biomedical question answering and semantic indexing, BioASQ 
, using their recommended evaluation metrics. We empirically demonstrate that with the presence of no labeled data from the biomedical domain – synthetic or human-annotated – our framework can achieve SOTA performance on these datasets. We further evaluate the domain adaptation capability of our framework by using clustering and dimensionality reduction techniques.
The primary contributions of the paper are as follows: (i) For the biomedical-MRC task, we propose a new adversarial domain adaptation framework with domain similarity discriminator that aims at reducing the domain variance between high-resource general-purpose domain and biomedical domain. ii) We leverage the unlabeled contexts from the biomedical domain and thus relax the need for synthetic or human-annotated labels (question-answer pairs) for target-domain data.
2 Background and related work
In this paper, we focus on the biomedical-MRC task using the adversarial learning-based domain adaptation technique. Thus, our work is in the confluence of two main research areas: biomedical machine reading comprehension and domain adaptation using adversarial learning.
2.1 Biomedical machine reading comprehension
In the biomedical machine reading comprehension (biomedical-MRC) task, the goal is to extract an answer span, given a user-defined question and a biomedical context. In neural network-based (NN-based) biomedical-MRC systems, the question-context pairs are converted from discrete textual form to continuous high-dimensional vector form using word-embedding algorithms, such as word2vec, GloVe , FastText , Bidirectional Encoder Representations from Transformers (BERT) 
, etc. Among numerous architectural varieties of these NN-based MRC systems, the transformer-based pre-trained language models (PLMs) such as BERT are the current SOTA. The original BERT model is pre-trained on general-purpose English corpora. Considering the semantic and syntactic uniqueness of the biomedical text, researchers have developed different variants of BERT models for the biomedical domain that are pre-trained on several biomedical corpora such as PubMed abstracts, PMC full articles, and MIMIC datasets. Some examples of such PLMs are BioBERT , PubMedBERT , BioElectra  which reportedly outperform the original BERT model in various biomedical NLP tasks. These PLMs are used as trainable encoding modules (encoders) in downstream biomedical NLP tasks, such as clinical-note classification 31], machine reading comprehension 
, etc. Usually, to accomplish the downstream tasks such as biomedical-MRC by transferring the knowledge from the PLMs, researchers add a few task-specific layers, commonly feed-forward neural network layers, at the end of the encoders[23, 29, 24, 2].
2.2 Transfer learning
Transfer learning is an approach to transfer knowledge representations acquired from a widely explored domain/task (source domain/task), to a new or less explored domain/task (target domain/task) .
Adopting the notations provided by , a domain consists of a feature space (different from the feature representation learned by the network) and a marginal distribution of the learning samples , . In NLP, the marginal distributions are different when the languages are the same but the topics are different in the source and target domains . Considering a label space , for a given domain , a task can be described as , where is a predictor function learned from the training data .
Data scarcity in the target domain is often an impediment to the model’s performance while training for a target task. Transfer learning tackles this issue by aiming to improve generalizability on a target task using acquired knowledge from a source domain , a target domain along with their respective associated tasks and . For this work, we use transductive transfer learning
which uses labeled source domain, unlabeled target domain, identical source and target tasks, and different source and target domains. Depending on the similarity in the feature spaces, there are two cases of transductive transfer learning: i) different feature spaces for the source and target domains, e.g., cross-lingual transfer learning, ii) identical feature spaces, but different marginal probability distributions for the source-domain and target-domain samples, e.g., transfer learning between two domains with the same language but different topics. In this paper, we use the latter case of transductive transfer learning to train the biomedical-MRC system, otherwise known as domain adaptation .
2.2.1 Domain adaptation
Domain adaptation aims at increasing the generalizability of machine learning models when posed with unlabeled or very few labeled data from the target domain by generating domain invariant representation. One can enforce the learning of domain-invariant features in machine learning models by implementing the adversarial learning framework [16, 50]. In the adversarial setting, usually, a domain discriminator is incorporated into the MRC framework where besides performing the MRC task, the goal is to attempt at fooling the discriminator by generating domain invariant features .
used conditional probability, IOB tagger, and Bi-LSTM, and used seq2seq model with an attention mechanism to generate pseudo question-answer pairs. A multi-task learning approach has also been used for domain adaptation in MRC tasks . Among these research works in MRC and domain adaptation, only a few  focused on learning domain-invariant features in an adversarial setting.
3 Materials and methods
In this section, we discuss our adversarial learning-based domain adaptation framework for the biomedical-MRC task. Figure 1 shows our framework, which consists of three primary components: feature extractor, domain similarity discriminator, and MRC-module.
3.1 Problem definition
Given an unlabeled target domain and a labeled source domain along with their respective learning tasks and , we assume that and because of , where is the marginal probability distribution, and are learning samples from the target and source domains, respectively. Thus, while the tasks are identical, the domains are different due to different marginal probability distributions in their data.
In this work, is the biomedical domain where only unlabeled biomedical contexts are available, and is the general-purpose domain where large-scale labeled data are available. As mentioned in section 2.2, despite having the same language, differences in the topics between two domains cause the domains to be different because of the dissimilarities in . In this work, we consider that the general-purpose and the biomedical domains have different topics. Thus, we assume that .
The task for both domains is extractive MRC. Given a question and a context , extractive MRC predicts the start and end positions and , respectively, of the answer span in such that there exists one and only one answer span consisting of continuous tokens in the context. Here, denotes the token in the question, denotes the token in the context, and , respectively denote the number of tokens in and .
Given the labeled and unlabeled inputs respectively from the source and target domains, our proposed framework BioADAPT-MRC aims at achieving the following two objectives: i) predicting the answer spans from the provided contexts, ii) addressing the discrepancies in the marginal distributions between the input in the source and target domains by generating domain-invariant features. Figure 1 demonstrates the three primary components of the BioADAPT-MRC framework:
Feature extractor accepts a text sequence and encodes it into a high-dimensional continuous vector representation.
MRC-module accepts the encoded representation from either the source domain (training time) or the target domain (test time), then predicts the start and end positions of the answer span in .
Domain similarity discriminator accepts the encoded representations from the source and target domains and learns to distinguish between them.
3.2.1 Feature extractor
Given an input sample from either domain, the feature extractor maps it to a common feature space :
Here, denotes the extracted feature for the input sample from either or . We utilize the encoder of the PLM BioELECTRA  as the feature extractor. We choose the BioELECTRA model for the following reason: While biomedical domain-specific BERT models such as BioBERT, PubMedBERT outperform the original BERT models in several biomedical NLP tasks , BioELECTRA has the best performance scores on the Biomedical Language Understanding and Reasoning Benchmark (BLURB) , in comparison with SciBERT , BioBERT , and PubMedBERT .
As mentioned in section 2.1, the features in this task are high-dimensional word embeddings extracted from the question-context pairs. To generate these word embeddings, the BioELECTRA model utilizes the transformer-based architecture from one of the BERT-variants, ELECTRA. The ELECTRA model has 12 layers, 768 hidden size, 3072 feed-forward network (FFN) inner hidden size, and 12 attention heads per layer . The pre-training corpora for BioELECTRA is 3.2 million PubMed Central full-text articles and 22 million PubMed abstracts, and the pre-training task is the replaced token prediction task. BioELECTRA has a vocabulary of size 30522.
The maximum number of input tokens per question-context pair can be 512, where the embedding dimension of each token is 768. For each pair, the final tokenized input of the BioELECTRA model is . Here, are respectively the tokens from the question and the context, is a special token that can be considered to have an accumulated representation of the input sequence  and as such can be used for classification/regression tasks, is another special token that separates two consecutive sequences. Note that, since the samples in the target domain are unlabeled, in place of the question tokens , we use a special token to maintain consistency in the structure of the tokenized samples. BioELECTRA model uses the WordPiece tokenization algorithm  to generate word embeddings that are trainable.
As the MRC-module , we add a simple fully-connected layer with hidden size on top of the feature extractor
and use the softmax activation function to generate probability distributions for start and end token positions following Equation2.
Here, and are the probabilities of the token to be predicted as and respectively,
is the hidden representation vector of thetoken, and are two trainable weight matrices, is the input sequence length.
We use the cross-entropy loss on the predicted answer positions as the objective function for the . Since for each answer span prediction, we get two predicted outputs for the start and end positions, we average the total cross-entropy loss as shown in Equation 3.
Here the golden answer’s start and end token positions are represented by and , respectively. During test phase, the predicted answer span is selected based on the positions of the highest probabilities from and .
3.2.3 Domain similarity discriminator
The domain similarity discriminator addresses the domain variance between two domains (caused by the discrepancies in the marginal probability distributions), as follows: In the adversarial setting, learns to distinguish between the feature representations of the source-domain and target-domain samples generated by the feature extractor. then penalizes the feature extractor for producing domain-variant feature representations and thus promotes the generation of domain-invariant features. uses cosine distance between the feature representations of the input samples to distinguish between the domains. We consider that two samples are closer in the embedding space and thus have a greater chance to be in the same domain if their feature representations have a smaller cosine distance between them and vice-versa.
The input of is a triplet where , , and are respectively the feature representations of the sample from the target domain and sample and sample from the source domain extracted by . The triplet is then split into two distinct pairs, consisting of and . As indicated in Figure 1, upon receiving each triplet, accomplishes two tasks: i) It measures the similarity between and dissimilarity between . ii) It performs MRC task similar to for the source sample .
acts as a function that helps estimate the similarity and dissimilarity between the received pairs. Considering the success of theencoder models in many NLP tasks, for the single-layer Siamese network, we adopt the same architecture as any encoder layer in the model which has 12 attention heads, 768 embedding dimensions, 3072 FFN inner hidden size with 10% dropout rate and ‘Gaussian Error Linear Unit (GeLU)’ activation function.
We first encode the feature representations of the input pairs using the same encoder network . Considering the role of the special token as explained in 3.2.1 to let differentiate whether the pairs are from the same domain or not, we extract the feature representations of the token from for and . We then use these
token representations to calculate the domain-similarity and dissimilarity via triplet loss function and use it as the learning objective of the discriminator as shown in Equations 4 and 5.
Here, is the batch size, , are the token representations of and samples from the source domain and the sample from the target domain where , is the cosine distance where
is the cosine similarity,is the non-negative margin representing the minimum difference between and that is required for the triplet loss to be . To optimize , the triplet loss function aims at minimizing the cosine distance between the samples from the source domain and maximizing the cosine distance between the samples from the source and target domains. Using triplet loss, our discriminator efficiently employs both similar and dissimilar information extracted by the feature extractor component of the model.
Using the concept of AC-GAN , for the second task of , we introduce an MRC-module similar to on top of the Siamese discriminator network as an auxiliary task layer. enforces that the discriminator does not lose task-specific information while learning to encode domain-variant features. The input of this auxiliary MRC-module is the output of for and the output is the probability distributions for the start and end token positions of the answer span, similar to . Thus, the loss function for the auxiliary module is the same as . We denote the final loss function for the discriminator as , where
3.2.4 Cost function
To eliminate domain shift and learn domain-invariant feature representations, we integrate , , and into adversarial learning framework where we update and to maximize the discriminator loss and minimize the task loss while updating to minimize the discriminator loss . Thus, the cost function of the BioADAPT-MRC framework consists of the task loss and the discriminator loss and is optimized end-to-end:
Here, is a regularization parameter to balance and .
4 Results and discussion
We perform an extensive study to evaluate the proposed framework and compare with the SOTA biomedical-MRC methods on a collection of publicly available and widely used benchmark biomedical-MRC datasets.
To demonstrate the effectiveness of our framework, we evaluate BioADAPT-MRC and compare it with the SOTA methods on three biomedical-MRC datasets from the BioASQ annual challenge . The BioASQ competition has been organized since 2013 and consists of two large-scale biomedical NLP tasks: i) task A - semantic indexing and ii) task B - question answering. Task B involves information retrieval and question-answering with four types of questions – Yes/No, Factoid, List, and Summary. Since task B with factoid questions resembles the extractive biomedical-MRC task, we utilize only the factoid MRC datasets from the BioASQ challenges held in 2019 (BioASQ-7b), 2020 (BioASQ-8b), and 2021 (BioASQ-9b) as the target-domain datasets to verify our model. All three datasets were created from the search engine for biomedical literature, PubMed, with the help of domain experts. Note that, for training, our framework requires only unlabeled contexts in the target domain. As such, we only consider the contexts in the BioASQ-7b,8b, and 9b training sets and disregard the question-answer pairs.
We reinforce the reliability of the comparison of experimental results with other methods and validate the efficacy of the BioADAPT-MRC framework by using the pre-processed training set for BioASQ-7b and 8b, provided by  and , respectively. For BioASQ-9b, we pre-process the training data by retrieving full abstracts from PubMed using their provided PMIDs. We use these retrieved abstracts as the contexts in the BioASQ-9b training set. Since our framework requires no label (question-answer pair) in the training set of the target domain, for all BioASQ training sets, we disregard each question by assigning a token and each answer by replacing by an empty string. At test time, we use the golden enriched test sets – BioASQ-7b, 8b, and 9b – from the BioASQ challenges.
. The SQuAD dataset was developed from Wikipedia articles by crowd-workers.
Table 1 shows the basic statistical information of all the datasets used in the experiments. As shown, the number of training data samples in the source domain is noticeably higher than that of the target domain.
|Dataset name||Training set (raw)||Training set (after preprocessing)||Target to source ratio in training set||Test set|
4.2 Experimental setup and training configurations
As described in section 3
, our BioADAPT-MRC framework consists of three main components: feature extractor, MRC-module, and discriminator. In the implementation of the framework with PyTorch, we initialize the feature extractor with the parameters from the pre-trained BioELECTRA model using the huggingface API 
. For the parameters in the MRC-module and the discriminator, we perform random initialization. For tokenization, we set the maximum query length to 64, the maximum answer length to 30, the maximum sequence length to 384, and the document stride to 128, as suggested in[42, 13, 29]
. We empirically determine that learning rate 5e-5 and batch size 32 are the best choices for our experiments. For each step in the training epoch, we randomly select two samples from the source domain dataset SQuAD and one sample from the target domain dataset BioASQ-7b/8b/9b. After trial and error, we set the regularization parameterto 0 and then increase it by 0.01 at every 10 epochs up to 0.04. We run all our experiments on a Linux server with Intel(R) Xeon(R) Gold 6130 CPU @ 2.10GHz and a single Tesla V100-SXM2-16GB GPU.
For evaluation, we utilize three metrics used in the MRC task in the official BioASQ challenge: strict accuracy (SAcc), lenient accuracy (LAcc), and mean reciprocal rank (MRR). For this task, the BioASQ challenge requires the participant systems to predict the five best-matched answer spans extracted verbatim from the context(s) in a decreasing order based on confidence score. In the golden test set, for each question, the biomedical experts in the BioASQ team provided one golden answer extracted from the context. Both golden answers and predicted answer spans are used to calculate the SAcc, LAcc, and MRR scores shown in Equation 8. SAcc shows the models’ capability to find exact answer location, LAcc determines the models’ understanding of predicted answers’ range and MRR reflects the quality of the predicted answer spans :
Here, is the number of questions in the test set, is the number of questions correctly answered by the predicted answer span with the highest confidence score, is the number of questions correctly answered by any of the five predicted answer spans, and is the rank of the golden answer among all five predicted answer spans for the question. If the golden answer does not belong to the five predicted answer spans, we consider , i.e., . We implement these evaluation metrics by leveraging the publicly available tools provided by the official BioASQ GitHub page at https://github.com/BioASQ/Evaluation-Measures.
4.4 Method comparison
We compare the test-time performance of BioADAPT-MRC on BioASQ-7b and 8b with six best performing models selected based on related published articles: Google , BioBERT , UNCC , Umass , KU-DMIS-2020 , and BioQAExternalFeatures . All these models except Umass use labeled biomedical-MRC data in the inductive transfer learning setting with data from the general-purpose domain. Umass, on the other hand, uses synthetic question-answer pairs generated by using unsupervised cloze translation . The current SOTA method on BioASQ-7b and 8b, BioQAExternalFeatures uses externally extracted syntactic and lexical features of the questions and contexts along with the labels , possibly exposing to different adversarial attacks that may leverage syntactic and lexical knowledge-base from the dataset . As the best performance in the BioASQ-9b, we calculate the best SAcc, LAcc, and MRR scores depending on the top scores in the BioASQ-9b leaderboard. We do this because no model in the leaderboard performed the best in all three metrics.
4.5 Experimental results
Table 2 shows the comparison of BioADAPT-MRC with the SOTA biomedical-MRC methods on BioASQ-7b, BioASQ-8b, BioASQ-9b. As shown, BioADAPT-MRC improves on both LAcc and MRR when tested on all three BioASQ test sets and achieves the SOTA performance.
|BioASQ-9b Challenge - Best scores **||–||–||–||–||–||–||0.5399||0.7300||0.6017|
We also notice that while our model achieves the SOTA SAcc score for BioASQ-9b, it achieves the second-best SAcc scores for BioASQ-7b and 8b. The higher SAcc and LAcc scores imply that our model is able to correctly extract complete answers from the given contexts more frequently than the previous methods. The higher MRR scores, on the other hand, reflect our model’s ability to extract complete answers with higher probability than the previous methods. In contrast to the previous works, our method uses no label information (question-answer pairs) during the training process and has still been able to achieve good performance, implying the effectiveness of our proposed framework.
As explained in section 3, in the framework, we propose a domain similarity discriminator with an auxiliary task layer that aims at promoting the generation of domain-invariant features in the feature extractor and thus improving the performance of the model. To show the effectiveness of the discriminator and the auxiliary task layer, we perform an ablation study and report the experimental results in Table 3.
|+Discriminator without auxiliary task layer||0.4506||0.6235||0.5232||0.3643||0.6093||0.4673||0.5583||0.7362||0.6314|
|BioADAPT-MRC (Baseline+Full Discriminator)||0.4506||0.6420||0.5286||0.3841||0.6159||0.4844||0.5583||0.7423||0.6308|
For a fair comparison, we perform all experiments under the same hyper-parameter settings. The baseline model shown in Table 3 consists of only the feature extractor and the MRC-module and was trained on the labeled source domain dataset, SQuAD. For the remaining two models, we use the labeled SQuAD and the unlabeled BioASQ training datasets during training simultaneously. The addition of the discriminator enables the feature extractor in the baseline model to use the unlabeled BioASQ training datasets for generating domain-invariant feature representations. This is done by using the dissimilarity measurements between the feature representations of the SQuAD and BioASQ data. As shown, after adding only the discriminator without the auxiliary task layer, the performance of the model improves from the baseline, suggesting the influence of the discriminator. We explain this influence on the feature extractor more elaborately later in this section. For the final experiment in the ablation study (Table 3), we use our whole model consisting of the domain similarity discriminator with the auxiliary task layer and notice an even further performance improvement. The auxiliary task layer, in this study, constrains the changes in the task-relevant features in the domain similarity discriminator during training. Thus, the improvement in model performance after incorporating the auxiliary task layer suggests that with the task layer, the domain similarity discriminator can better promote the generation of domain-invariant features that are simultaneously discriminative from the viewpoint of the MRC task in the source domain.
4.5.1 Domain adaptation
We show the influence of the domain similarity discriminator by plotting (Figure 2) all samples from the BioASQ-9b test set and a set of random samples from the SQuAD training set.
We pick random samples from the SQuAD training set to match the number of samples in the BioASQ-9b test set. As explained in section 3.2.1, we use the feature representation of the token as an accumulated representation of the whole input sequence. Each feature representation of the token has a dimension of 768. To reduce these dimensions into two for visualization, we use multidimensional scaling (MDS) . We use MDS because it reduces the dimensions by preserving the dissimilarities between two data points in the original high-dimensional space. Since we use cosine distance in the discriminator to measure the dissimilarity between two domains, as the dissimilarity measure in the MDS, we use the pairwise cosine distance.
The feature representations of the token on the left plot and the right plot in Figure 2 are generated by the feature extractors from the baseline model and the BioADAPT-MRC model, respectively. For a fair comparison, the selection of random SQuAD training samples is the same for the baseline and BioADAPT-MRC models. As shown, the features generated by the baseline model create two separate clusters for SQuAD and BioASQ-9b. The features generated by the BioADAPT-MRC, on the other hand, form two overlapping clusters implying the reduced dissimilarities between the source and target domains. Interestingly, we notice that the data points from the BioASQ are closer to its cluster than those from the SQuAD. It may be because, unlike SQuAD, the data in the BioASQ originate from one single domain, and thus the feature representations are more similar to one another.
|Mean accuracy (standard deviation)||Mean silhouette score (standard deviation)|
To further analyze the quality of the clusters before and after introducing the domain similarity discriminator to the framework and thus to quantify the effect of domain adaptation, we perform DBSCAN clustering . We perform clustering on the MDS components of the features for the tokens for the samples in the BioASQ test sets and the random samples from the SQuAD training set. Considering the bias of random sampling, for each BioASQ test set, we select five sets of random samples from the SQuAD training set and report the mean accuracy and silhouette scores with standard deviation in Table 4 and Figures 3 and 4.
We use the DBSCAN clustering because it views clusters as high-density regions where the distance between the samples is measured by a distance metric, providing flexibility in shapes and numbers of clusters. As the distance measure for DBSCAN, we choose pairwise cosine distance. We implement the DBSCAN algorithm using the scikit-learn tool 
and tune the two essential hyperparameters:and . The details on these two hyperparameters can be found on the scikit-learn documentation. After optimization, we select 0.01, 0.05, and 0.005 as the for BioASQ-7b, 8b and 9b respectively and 20 as the for all three test sets. Note that, for a fair comparison, for each test set, the choice of hyperparameters are the same for the baseline and BioADAPT-MRC models.
Table 4 and Figure 3 shows that DBSCAN can identify two clusters with high accuracy when the features of the samples are extracted from the baseline model. The accuracy goes down when the features of the same samples are extracted from the BioADAPT-MRC model as they form a single cluster. Moreover, we analyze the silhouette scores to understand the separation distance between clusters. The range of silhouette score is . A score of 1 indicates that the clusters are highly dense and clearly distinguishable from each other whereas -1 refers to incorrect clustering. A score of zero or near zero indicates indistinguishable or overlapping clusters. As shown in Table 4 and Figure 4, in this case, the high silhouette scores (closer to 1) for the baseline model reflect that the feature representations of the samples from the same domain are very similar to its own cluster compared to the other one. On contrary, the low silhouette scores (closer to zero) for the BioADAPT-MRC model indicate that the feature representations of the samples from both domains are very similar to one another. These results show the effectiveness of the domain similarity discriminator in the BioADAPT-MRC framework.
4.5.2 Motivating example
Considering the variability of the predicted answers in an MRC-task, we choose to present a motivating example to demonstrate how the word importance may impact the answer predictions and thus the performance of the biomedical-MRC task (Figure 5).
This example is selected randomly from the BioASQ-9b test set from the samples incorrectly predicted by the baseline model and correctly predicted by the BioADAPT-MRC model. The question is separated from the context by the token and the correct answer, in this case, is . The predicted start and end tokens are highlighted in yellow. The positive and negative importance of the words is highlighted respectively in green and red color. The density of the color shows the amount of importance. Higher density reflects higher importance. We calculate the word importance using the Integrated Gradients algorithm  implemented in the Captum tool . Integrated gradients measure the importance of the word (in predicting the answer span) by computing the gradient of the predicted output with respect to the specified input word .
In this particular example, the start and end token being the same, we expect that the correct prediction of the MRC model should provide similar importance to the words for the start and end position predictions. However, while this is true for the BioADAPT-MRC model (right plot), for the baseline (left plot) we notice a different pattern in word importance while predicting for start and end positions and thus result in wrong answer span prediction. It shows the effectiveness of our model over the baseline for this example.
Biomedical machine reading comprehension is a crucial and emerging task in the biomedical domain pertaining to natural language processing. Biomedical-MRC aims at perceiving complex contexts from the biomedical domain and helping medical professionals to extract information from them. Most MRC methods rely on a high volume of human-annotated data for near or similar to human-level performance. However, acquiring a labeled MRC dataset in the biomedical domain is expensive in terms of domain expertise, time, and effort, creating the need for transfer learning from a source domain to a target domain. Due to variance between two domains, directly transferring an MRC model to the target domain often negatively affects its performance. We propose a new framework for biomedical machine reading comprehension, BioADAPT-MRC, addressing the issue of domain variance by using a domain adaptation technique in an adversarial learning setting. We use a labeled MRC dataset from a general-purpose domain (source domain) along with unlabeled contexts from the biomedical domain (target domain) as our training data. We introduce a domain similarity discriminator, aiming to reduce the domain variance between the general-purpose domain and biomedical domain to help boost the performance of the biomedical-MRC model. We validate our proposed framework on three widely used benchmark datasets from the biomedical question answering and semantic indexing challenge, BioASQ. We comprehensively demonstrate that without any label information in the target domain during training, the BioADAPT-MRC framework can achieve SOTA performance on these datasets. We perform an extensive quantitative study on the domain adaptation capability using dimensionality reduction and clustering techniques and show that our framework can learn domain-invariant feature representations. We conclude that BioADAPT-MRC may be beneficial in healthcare systems as a tool to efficiently retrieve information from complex narratives and thus save valuable time and effort of the healthcare professionals. For future work, we would like to pursue the following directions: i) Applying our framework to other NLP applications in the biomedical domain that suffer from labeled-data-scarcity issues. Such applications are biomedical named entity recognition, clinical negation detection, etc. ii) Analyzing the robustness of the domain-invariant feature representations learned by the BioADAPT-MRC model against meticulously crafted adversarial attack scenarios that may leverage syntactic and lexical knowledge-base from the dataset.
Maria Mahbub: Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Software, Validation, Visualization, Writing – original draft, Writing – review & editing. Sudarshan Srinivasan: Writing – review & editing. Edmon Begoli: Supervision, Writing – review & editing. Gregory D. Peterson: Supervision, Writing – review & editing.
Bhavani Singh Agnikula Kshatriya, Elham Sagheb, Chung-Il Wi, Jungwon Yoon,
Hee Yun Seol, Young Juhn, and Sunghwan Sohn.
Identification of asthma control factor in clinical notes using a hybrid deep learning model.BMC medical informatics and decision making, 21(7):1–10, 2021.
-  Sultan Alrowili and K Vijay-Shanker. Biom-transformers: building large biomedical language models with bert, albert and electra. In Proceedings of the 20th Workshop on Biomedical Language Processing, pages 221–227, 2021.
-  Emily Alsentzer, John R Murphy, Willie Boag, Wei-Hung Weng, Di Jin, Tristan Naumann, and Matthew McDermott. Publicly available clinical bert embeddings. In Proceedings of the 2nd Clinical Natural Language Processing Workshop, pages 72–78. Association for Computational Linguistics, 2019.
-  Samar Bashath, Nadeesha Perera, Shailesh Tripathi, Kalifa Manjang, Matthias Dehmer, and Frank Emmert Streib. A data-centric review of deep transfer learning with applications to text data. Information Sciences, 585:498–528, 2022.
-  Hilda Bastian, Paul Glasziou, and Iain Chalmers. Seventy-five trials and eleven systematic reviews a day: how will we ever keep up? PLoS medicine, 7(9):e1000326, 2010.
-  Iz Beltagy, Kyle Lo, and Arman Cohan. SciBERT: A pretrained language model for scientific text. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3615–3620, Hong Kong, China, November 2019. Association for Computational Linguistics.
-  Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics, 5:135–146, 2017.
-  Jane Bromley, Isabelle Guyon, Yann LeCun, Eduard Säckinger, and Roopak Shah. Signature verification using a” siamese” time delay neural network. Advances in neural information processing systems, 6, 1993.
-  Danqi Chen, Adam Fisch, Jason Weston, and Antoine Bordes. Reading Wikipedia to answer open-domain questions. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1870–1879, Vancouver, Canada, July 2017. Association for Computational Linguistics.
-  Xilun Chen, Yu Sun, Ben Athiwaratkun, Claire Cardie, and Kilian Weinberger. Adversarial deep averaging networks for cross-lingual sentiment classification. Transactions of the Association for Computational Linguistics, 6:557–570, 2018.
-  Kevin Clark, Minh-Thang Luong, Quoc V. Le, and Christopher D. Manning. ELECTRA: Pre-training text encoders as discriminators rather than generators. In ICLR, 2020.
-  Guilherme Del Fiol, T Elizabeth Workman, and Paul N Gorman. Clinical questions raised by clinicians at the point of care: a systematic review. JAMA internal medicine, 174(5):710–718, 2014.
-  Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186. Association for Computational Linguistics, 2019.
-  Martin Ester, Hans-Peter Kriegel, Jörg Sander, Xiaowei Xu, et al. A density-based algorithm for discovering clusters in large spatial databases with noise. In kdd, volume 96, pages 226–231, 1996.
-  Susannah Fox and Maeve Duggan. Health online 2013. Health, 2013:1–55, 2013.
Yaroslav Ganin and Victor Lempitsky.
Unsupervised domain adaptation by backpropagation.In International conference on machine learning, pages 1180–1189. PMLR, 2015.
-  Xavier Glorot, Antoine Bordes, and Yoshua Bengio. Domain adaptation for large-scale sentiment classification: A deep learning approach. In ICML, 2011.
-  David Golub, Po-Sen Huang, Xiaodong He, and Li Deng. Two-stage synthesis networks for transfer learning in machine comprehension. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 835–844, Copenhagen, Denmark, September 2017. Association for Computational Linguistics.
-  Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. Advances in neural information processing systems, 27, 2014.
-  Yu Gu, Robert Tinn, Hao Cheng, Michael Lucas, Naoto Usuyama, Xiaodong Liu, Tristan Naumann, Jianfeng Gao, and Hoifung Poon. Domain-specific language model pretraining for biomedical natural language processing. ACM Transactions on Computing for Healthcare (HEALTH), 3(1):1–23, 2021.
-  Dayan Guan, Jiaxing Huang, Shijian Lu, and Aoran Xiao. Scale variance minimization for unsupervised domain adaptation in image segmentation. Pattern Recognition, 112:107764, 2021.
-  Philip N Hider, Gemma Griffin, Marg Walker, and Edward Coughlan. The information-seeking behavior of clinical staff in a large health care organization. Journal of the Medical Library Association: JMLA, 97(1):47, 2009.
-  Stefan Hosein, Daniel Andor, and Ryan McDonald. Measuring domain portability and error propagation in biomedical qa. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 686–694. Springer, 2019.
-  Minbyul Jeong, Mujeen Sung, Gangwoo Kim, Donghyeon Kim, Wonjin Yoon, Jaehyo Yoo, and Jaewoo Kang. Transferability of natural language inference to biomedical question answering. Working notes of CLEF 2020 conference and labs of the evaluation forum, 2020.
-  Mandar Joshi, Eunsol Choi, Daniel Weld, and Luke Zettlemoyer. TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1601–1611, Vancouver, Canada, July 2017. Association for Computational Linguistics.
-  Narine Kokhlikyan, Vivek Miglani, Miguel Martin, Edward Wang, Bilal Alsallakh, Jonathan Reynolds, Alexander Melnikov, Natalia Kliushkina, Carlos Araya, Siqi Yan, et al. Captum: A unified and generic model interpretability library for pytorch. arXiv preprint arXiv:2009.07896, 2020.
-  Vaishnavi Kommaraju, Karthick Gunasekaran, Kun Li, Trapit Bansal, Andrew McCallum, Ivana Williams, and Ana-Maria Istrate. Unsupervised pre-training for biomedical question answering. In CLEF (Working Notes), 2020.
-  Joseph B Kruskal. Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothesis. Psychometrika, 29(1):1–27, 1964.
-  Jinhyuk Lee, Wonjin Yoon, Sungdong Kim, Donghyeon Kim, Sunkyu Kim, Chan Ho So, and Jaewoo Kang. Biobert: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics, 36(4):1234–1240, 2020.
-  Tomás Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space. In Yoshua Bengio and Yann LeCun, editors, 1st International Conference on Learning Representations, ICLR 2013, Scottsdale, Arizona, USA, May 2-4, 2013, Workshop Track Proceedings, 2013.
-  Usman Naseem, Matloob Khushi, Vinay Reddy, Sakthivel Rajendran, Imran Razzak, and Jinman Kim. Bioalbert: A simple and effective pre-trained language model for biomedical named entity recognition. In 2021 International Joint Conference on Neural Networks (IJCNN), pages 1–7. IEEE, 2021.
-  Anastasios Nentidis, Georgios Katsimpras, Eirini Vandorou, Anastasia Krithara, Luis Gasco, Martin Krallinger, and Georgios Paliouras. Overview of bioasq 2021: The ninth bioasq challenge on large-scale biomedical semantic indexing and question answering. In International Conference of the Cross-Language Evaluation Forum for European Languages, pages 239–263. Springer, 2021.
-  Kosuke Nishida, Kyosuke Nishida, Itsumi Saito, Hisako Asano, and Junji Tomita. Unsupervised domain adaptation of language models for reading comprehension. In LREC, 2020.
Augustus Odena, Christopher Olah, and Jonathon Shlens.
Conditional image synthesis with auxiliary classifier gans.In International conference on machine learning, pages 2642–2651. PMLR, 2017.
-  Sinno Jialin Pan and Qiang Yang. A survey on transfer learning. IEEE Transactions on knowledge and data engineering, 22(10):1345–1359, 2009.
-  Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32:8026–8037, 2019.
-  Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, et al. Scikit-learn: Machine learning in python. Journal of machine learning research, 12:2825–2830, 2011.
-  Jeffrey Pennington, Richard Socher, and Christopher D Manning. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 1532–1543, 2014.
-  Gabriele Pergola, Elena Kochkina, Lin Gui, Maria Liakata, and Yulan He. Boosting low-resource biomedical QA via entity-aware masking strategies. In Paola Merlo, Jörg Tiedemann, and Reut Tsarfaty, editors, Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, EACL 2021, Online, April 19 - 23, 2021, pages 1977–1985. Association for Computational Linguistics, 2021.
-  Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 2227–2237, New Orleans, Louisiana, June 2018. Association for Computational Linguistics.
-  Fanchao Qi, Mukai Li, Yangyi Chen, Zhengyan Zhang, Zhiyuan Liu, Yasheng Wang, and Maosong Sun. Hidden killer: Invisible textual backdoor attacks with syntactic trigger. In ACL/IJCNLP (1), pages 443–453, 2021.
-  Kamal Raj Kanakarajan, Bhuvana Kundumani, and Malaikannan Sankarasubbu. Bioelectra: Pretrained biomedical text encoder using discriminators. In Proceedings of the 20th Workshop on Biomedical Language Processing, pages 143–154, 2021.
-  Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. SQuAD: 100,000+ questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2383–2392, Austin, Texas, November 2016. Association for Computational Linguistics.
-  Tony Russell-Rose and Jon Chamberlain. Expert search strategies: the information retrieval practices of healthcare information professionals. JMIR medical informatics, 5(4):e7680, 2017.
-  Sining Sun, Binbin Zhang, Lei Xie, and Yanning Zhang. An unsupervised deep domain adaptation approach for robust speech recognition. Neurocomputing, 257:79–87, 2017.
-  Mukund Sundararajan, Ankur Taly, and Qiqi Yan. Axiomatic attribution for deep networks. In Proceedings of the 34th International Conference on Machine Learning - Volume 70, ICML’17, page 3319–3328. JMLR.org, 2017.
-  Sai Krishna Telukuntla, Aditya Kapri, and Wlodek Zadrozny. Uncc biomedical semantic question answering systems. bioasq: Task-7b, phase-b. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 695–710. Springer, 2019.
-  Brian Thompson, Jeremy Gwinnup, Huda Khayrallah, Kevin Duh, and Philipp Koehn. Overcoming catastrophic forgetting during domain adaptation of neural machine translation. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 2062–2068, 2019.
-  George Tsatsaronis, Georgios Balikas, Prodromos Malakasiotis, Ioannis Partalas, Matthias Zschunke, Michael R Alvers, Dirk Weissenborn, Anastasia Krithara, Sergios Petridis, Dimitris Polychronopoulos, et al. An overview of the bioasq large-scale biomedical semantic indexing and question answering competition. BMC bioinformatics, 16(1):1–28, 2015.
Eric Tzeng, Judy Hoffman, Kate Saenko, and Trevor Darrell.
Adversarial discriminative domain adaptation.
Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7167–7176, 2017.
-  Thuy-Trang Vu, Dinh Phung, and Gholamreza Haffari. Effective unsupervised domain adaptation with adversarially trained language models. pages 6163–6173, 01 2020.
-  Huazheng Wang, Zhe Gan, Xiaodong Liu, Jingjing Liu, Jianfeng Gao, and Hongning Wang. Adversarial domain adaptation for machine reading comprehension. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2510–2520, Hong Kong, China, November 2019. Association for Computational Linguistics.
-  Kilian Q Weinberger, John Blitzer, and Lawrence Saul. Distance metric learning for large margin nearest neighbor classification. Advances in neural information processing systems, 18, 2005.
-  Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, et al. Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771, 2019.
-  Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144, 2016.
-  Gezheng Xu, Wenge Rong, Yanmeng Wang, Yuanxin Ouyang, and Zhang Xiong. External features enriched model for biomedical question answering. BMC bioinformatics, 22(1):1–19, 2021.
-  Wonjin Yoon, Jinhyuk Lee, Donghyeon Kim, Minbyul Jeong, and Jaewoo Kang. Pre-trained language model for biomedical question answering. In Peggy Cellier and Kurt Driessens, editors, Machine Learning and Knowledge Discovery in Databases - International Workshops of ECML PKDD 2019, Proceedings, pages 727–740. Springer, 2020.
-  Xiang Yue, Xinliang Frederick Zhang, Ziyu Yao, Simon M. Lin, and Huan Sun. Cliniqg4qa: Generating diverse questions for domain adaptation of clinical question answering. 2021 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pages 580–587, 2021.