Transferability of Natural Language Inference to Biomedical Question Answering

07/01/2020 ∙ by Minbyul Jeong, et al. ∙ Korea University 0

Biomedical question answering (QA) is a challenging problem due to the scarcity of data and the requirement of domain expertise. Growing interests of using pre-trained language models with transfer learning address the issue to some extent. Recently, learning linguistic knowledge of entailment in sentence pairs enhances the performance in general domain QA by leveraging such transferability between the two tasks. In this paper, we focus on facilitating the transferability by unifying the experimental setup from natural language inference (NLI) to biomedical QA. We observe that transferring from entailment data shows effective performance on Yes/No (+5.59 (+13.58 Phase B). We also observe that our method generally performs well in the 8th BioASQ Challenge (Phase B). For sequential transfer learning, the order of how tasks are fine-tuned is important. In factoid- and list-type questions, we thoroughly analyze an intrinsic limitation of the extractive QA setting when these questions are converted to the same format of the Stanford Question Answering Dataset (SQuAD).

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Biomedical question answering (QA) is a challenging problem due to the limited amount of data and the requirement of domain expertise. Recent success thanks to transfer learning [13, 28] address the issues by using pre-trained language models [6, 22] and further fine-tuning on a target task [8, 14, 23, 29, 34, 36]. In spite of performance gains from transfer learning, results are still short of the upper bound in biomedical QA. Sequential transfer learning has been introduced as an improvement of transfer learning in order to further push performance closer to the upper bound [14, 34, 36]. For example, fine-tuning from large scale SQuAD dataset [25] to the much smaller BioASQ dataset [31] guarantees performance compared to leveraging the BioASQ dataset solely. In the general domain, training on the linguistic knowledge of entailment between sentence pairs shows effectiveness when deployed as the first step in sequential transfer learning pipeline [4, 23, 24, 32, 33]. Thus, in this paper, we try to exploit the task of NLI [3, 35] to enhance the performance of biomedical QA. We find that the performance improves when the objective function of the fine-tuned task becomes similar to the downstream task. We also investigate that adapting NLI to biomedical QA confronts the obstacle of task discrepancy. Task discrepancy refers to the several differences between fine-tuned tasks such as distribution of context length, objective function and domain shift.

Specifically, between NLI and biomedical QA, we focus on reducing the discrepancy of context length to boost the performance of sequential transfer learning. In order to resolve the discrepancy, we unify the distribution of context length among fine-tuned tasks. We reorganize the SQuAD context into a single sentence containing the ground truth answer spans [18]. Fine-tuning on a unified distribution achieves speed improvements 52.95% on training and 25% on inference in BioASQ with comparable results. Finally, we introduce an intrinsic limitation of the extractive QA setting regarding answerability when the BioASQ dataset is converted to the same format as the SQuAD dataset.

Our contributions are as follows:

  1. Leveraging a NLI dataset as a fine-tuning procedure is meaningful to the Yes/No, Factoid and List type questions in BioASQ.

  2. We demonstrate that a simple variation in the experimental setup can aid the transferability of NLI to biomedical QA.

  3. In Factoid and List type questions, we introduce an intrinsic limitation of the extractive QA setting, when the BioASQ data is converted to the SQuAD format.

2 Related Works

2.0.1 Transfer Learning

Transfer learning, also known as domain adaptation, refers to the situation in which the transfer of knowledge learned in a previous task improves learning in the following task. In various fields including image processing or natural language processing (NLP), many studies have shown the effectiveness of transfer learning based on deep neural networks

[9, 17, 19, 28, 37]. More recently, especially in NLP, pre-trained language models such as ELMo [22] and BERT [6] lead most of NLP problems concerning transfer learning [4, 6, 12, 13, 16, 21, 23]. In a specific domain, unsupervised pre-training has also been introduced for biomedical contextualized representations [2, 10, 14]. Among them, BioBERT [14] is fine-tuned on biomedical corpora (e.g., PubMed and PubMed Central) using BERT and various tasks exploit BioBERT on the biomedical or clinical domain [1, 2, 10, 11, 21, 36].

2.0.2 Transferability of Natural Language Understanding

From the perspective of QA, the authors of [34] used the transfer of knowledge from a large open-domain SQuAD dataset to the target BioASQ dataset in order to handle the issue of scarcity. In [14, 36], the authors adopted sequential transfer learning (e.g., BioBERT-SQuAD-BioASQ) to boost the performance of biomedical QA. Meanwhile, from the NLI point of view, multiple datasets corresponding to the general domain have emerged [3, 15, 25, 33, 35] and recently domain-specific data (e.g., biomedical) have also appeared [21, 27]. In [23], the authors investigate that fine-tuning with MultiNLI [35] enhances the performance of target tasks consistently in all GLUE benchmarks [33]. Our work is more related to [5] where the authors suggest that transfer learning from NLI to diverse yes or no type QA tasks surely improves the performance in the general domain. Furthermore, the authors of [32] extensively experiment with the combination of question answering, text classification/regression, and sequence labeling with the constraint of data size. In this paper, we further facilitate the linguistic knowledge of the MultiNLI (MNLI) dataset to improve the performance of biomedical QA.

3 Methods

In this section, we outline our problem setup for the downstream task. Our training details are described in Appendix 0.A. We formally explain our framework to learn biomedical entity representations with BioBERT. Then we describe how to proceed with sequential fine-tuning according to each biomedical question type of the BioASQ Challenge. The intention of our method is to facilitate the transferability of NLI to the BioASQ.

3.1 Problem Setup

We convert the BioASQ dataset to the SQuAD dataset format. In detail, instances in the BioASQ dataset are composed of a question (Q), human-annotated answers (A), and the relevant contexts (C) (also called snippets). Although the span of an answer is not suggested, we first find the exact spans in the contexts based on the human-annotated answers for factoid and list types. In this case, we enumerate all the combinations of Q-C-A triplets only when the answer exact matches the context with confidence in precise spans. For Yes/No type, yes and no

answers cannot exactly match the context, thus we fine-tune a task-specific binary classifier to predict the answer.

3.2 Overall Architecture

The input sequence X consists of the concatenation of the BERT [CLS] token and the Q and C, with [SEP] tokens in between. This is denoted as X = {[CLS] Q [SEP] C [SEP]} where

refers to the concatenation of tensors. The hidden representation vector for the

input token is denoted as with the hidden size H. Finally, we fine-tune the hidden vectors which are fed into softmax or binary classifier corresponding to each question type.

3.2.1 Yes/No Type

We compute the yes probability

by projecting a linear transformation matrix

to transform the hidden representation of [CLS] token

. The sigmoid function is used for binary classification. The yes probability is calculated as follows.

(1)

The binary cross entropy loss is utilized between yes probability and its corresponding ground truth answer . Our total loss is computed as below.

(2)

3.2.2 Factoid & List type

We compute the start and end vector through one linear transformation matrix at hidden representation vectors. Let us denote a predicted answer tokens as start and end. The probability of () can be calculated as follows,

(3)

where s denotes the sequence length of BioBERT and refers to the dot-product. Our objective function is the negative log-likelihood for the predicted answer with the ground truth answer position. Computed start and end position loss are as below:

(4)

where N denotes the batch size, and and are the ground truth answers of the start and end positions of each instance, respectively. Our total loss is the arithmetic mean of and .

3.3 Transferability through domains and tasks

3.3.1 Yes/No Type

Learning to classify entailment can enhance a model’s ability in the general domain for yes or no type QA [5]. Following this finding, we think that the classification ability could be extended to the yes and no type of biomedical QA. Thus, we adopt the NLI task to solve biomedical yes or no type QA. We leverage the MNLI dataset because it is widely used and has enough data with multiple genres. For our learning sequence, we fine-tune BioBERT on the MNLI dataset to learn the linguistic knowledge of entailment between hypothesis and premise sentence pairs. We compose a sequence of transfer learning as BioBERT-MNLI-BioASQ. However, replacing the binary classifier to compute with final layer of MNLI task shows no improvement in BioASQ performances. For this reason, we add a simple binary classifier on top of BioBERT to be fine-tuned. Furthermore, the distributions of context length in MNLI and snippet of Yes/No type in BioASQ are similar. Therefore, we skip the unifying of context length distribution in yes and no type.

3.3.2 Factoid & List Type

To bridge the gap between different tasks, the order of sequential transfer learning is important. We investigate that the performance gain appears when the objective function of the fine-tuned task becomes similar to the downstream task in Table 5. Thus, we build our base learning sequence as BioBERT-MNLI-SQuAD-BioASQ rather than switching the order of intermediate tasks such as BioBERT-SQuAD-MNLI-BioASQ. In order to resolve the discrepancy of context length, we give a little variation to the original experimental setting. As suggested in [18], we reorganize the distribution of context length in the SQuAD dataset similar to the MNLI context and BioASQ snippet dataset. We aim to develop an extractive QA setup that is scalable to minimal context rather than using irrelevant sentences in full abstract [36]. Therefore, we extract a sentence containing the ground truth answer span and set as a total paragraph to construct a minimal context. As a result, we reduce the difference by unifying the distribution of context length in our sequential transfer learning. Due to the converted distribution of context length, we achieve speed improvements on training and inference in factoid and list type questions while achieving comparable results.

4 Experiments

4.1 Datasets

Our datasets are based on the pre-processed version provided by [25, 35, 36]. For the extractive QA setting, we convert all types of the BioASQ dataset (i.e., Yes/No, Factoid and List) into the same format as the SQuAD dataset. In [36], the authors suggested three pre-processing strategies and we utilize two of the three strategies: Snippet-as-is and Full-Abstract. We modify the previous data with a new criterion of including white space before and after each biomedical entity. This new criterion has shown to enhance the distinguishing of biomedical named entities. The statistics of the revised dataset are listed in Table 8. We have made the modified version of the BioASQ dataset publicly available.111https://github.com/dmis-lab/bioasq8b For the unified experimental setting, we remove approximately 5K training instances in SQuAD dataset due to the missing cases of a string match between context and answer spans.

max width= Reference System  Yes/No (Macro F1)  Factoid (MRR)  List (F1) Dimitriadis & Tsoumakas [7] 0.5541 - - Hosein et al., [8] - 0.4562 - Oita et al., [20] 0.4831 - - Resta et al., [26] 0.7873 - - Telukuntla et al., [30] 0.4486 0.4751 0.2002 Yoon et al., [36] 0.7169 0.5116 0.4061 Ours 0.8432 0.5163 0.5419

Table 1: BioASQ 7B Phase B challenge results and our results. We use dash (-) if the paper doesn’t suggest results corresponding to each question type. All scores are averaged of best score when batch results are reported in paper. Bold denotes the best score in each column.

4.2 Experimental Results

In Table 1, we compare our results with last year’s BioASQ Challenge Task 7B (Phase B) scores [7, 8, 20, 26, 30, 36]. In comparison with the best results in the previous challenge, we observe that transferrring from MNLI shows significant performance gain on the Yes/No (+5.59%), Factoid (+0.53%) and List (+13.58%) types.

max width= Yes/No Type  # of Task  Sequence of Transfer Learning Evaluation Metric  Accuracy  Yes F1  No F1  Macro F1 6B Test BioBERT-SQuAD-BioASQ 0.8518 0.9004 0.6896 0.7950 BioBERT-MNLI-BioASQ 0.8857 0.9212 0.7798 0.8505 7B Test BioBERT-SQuAD-BioASQ 0.8595 0.8990 0.7344 0.8167 BioBERT-MNLI-BioASQ 0.8945 0.9275 0.7588 0.8432

Table 2: Yes/No type question experiments. Evaluation metrics are accuracy (Accuracy), f1 score of yes type (Yes F1), f1 score of no type (No F1) and macro f1 score of yes and no type (Macro F1). Bold denotes the best score of the columns in each task.

First, the Yes/No type results of our method are shown in Table 2. We observe that using SQuAD as an intermediate fine-tuning procedure enhances performance [14, 34, 36]. Therefore, we evaluate our baseline with a fine-tuning sequence of BioBERT-SQuAD-BioASQ identical to [14, 36]. The SQuAD dataset is included as part of an intermediate in order to facilitate the understanding of QA task. On the contrary, a sequence of BioBERT-MNLI-BioASQ significantly outperforms the baseline with improvements at Macro F1 (+5.55%, +2.65%). We think selecting yes and no types in BioASQ is similar to deciding the relationship of entailment between hypothesis and premise in the MNLI task. We also replace the binary classifier of the BioASQ task with a trained MNLI classifier, but it shows no improvement. Thus, we attempt to fine-tune a binary classfier to select yes and no type.

max width= Context Length Discrepancy # of Task   Setting Sequence of Transfer Learning Factoid (%) List (%) SAcc LAcc MRR Prec Recall F1 6B Test Original BioBERT-SQuAD-BioASQ 39.80 57.82 47.22 45.02 47.69 42.34 BioBERT-MNLI-SQuAD-BioASQ 38.80 61.34 47.42 46.60 47.01 42.44 Document BioBERT-SQuAD-BioASQ 39.71 56.37 45.81 46.81 40.26 39.63 BioBERT-MNLI-SQuAD-BioASQ 39.71 55.10 45.77 46.26 39.23 38.13 Snippet BioBERT-SQuAD-BioASQ 38.23 57.34 46.24 48.24 46.86 42.83 BioBERT-MNLI-SQuAD-BioASQ 41.41 57.40 48.05 46.01 45.95 42.75 7B Test Original BioBERT-SQuAD-BioASQ 41.95 58.30 48.66 61.32 52.83 52.36 BioBERT-MNLI-SQuAD-BioASQ 42.22 61.06 49.85 61.46 54.62 54.19 Document BioBERT-SQuAD-BioASQ 44.46 57.98 50.02 58.30 39.19 43.89 BioBERT-MNLI-SQuAD-BioASQ 43.34 58.13 49.21 61.01 41.82 45.78 Snippet BioBERT-SQuAD-BioASQ 40.79 58.93 48.27 60.08 53.96 53.18 BioBERT-MNLI-SQuAD-BioASQ 45.10 62.45 51.63 60.92 53.12 53.01

Table 3: Experiments of Context Length Discrepancy. Factoid evaluation metrics are strict accuracy (SAcc), lenient accuracy (LAcc) and mean reciprocal rank (MRR). List evaluation metrics are precision (Prec), recall (Recall) and score of macro f1 (F1). Original indicates training in full document in SQuAD and using snippet in BioASQ. Document recurs to train in full document in SQuAD and use full abstract in BioASQ. Snippet denotes to train in a unified distribution of minimal context. All scores are averaged of 5 batch results. Bold denotes the best score of the columns in each task.

When leveraging the MNLI dataset in the factoid and list types, we have to consider the discrepancy of context lengths. The results are shown in Table 3. In the original setting, the SQuAD dataset is trained with full documents and the snippet is utilized in learning the BioASQ dataset. There are no performance gains in 6B test dataset. However, we observe that the performance enhances as the size of the training dataset increases as shown in the 7B test dataset.

In the document setting, we respectively leverage the whole paragraph and full abstract of the SQuAD and BioASQ dataset. We investigate that this setting shows lower performance than the original setting due to the expansion of context. In other words, the search space to find the answer has been expanded in the full abstract rather than using the human annotated dataset (i.e., snippet). Nevertheless, in 7B test dataset, the perfomance enhances exceptionally for the factoid type when fine-tuned on the SQuAD dataset.

For the snippet setting, we unify the distributions of context length in the extractive QA tasks. By extracting a sentence which contains the ground truth answer span, we observe improvement in 6B & 7B test dataset. For list type questions, we need further analyses to reduce the context length difference in future work. For example, instead of producing one answer in an intermediate task such as SQuAD, we could modify the model to yield multiple answers.

max width= # of Batch Yes/No Factoid List  Macro Avg. System Name Macro F1 System Name MRR System Name F1 8B batch 1 Ours 0.8663 Ours 0.4438 Ours 0.3718 0.5606 FudanLabZhu1 0.4518 FudanLabZhu1 0.4557 FudanLabZhu1 0.3408 0.4161 Umass_czi_4 0.5989 Umass_czi_4 0.3005 Umass_czi_4 0.3448 0.4147 8B batch 2 Ours 0.8928 Ours 0.3533 Ours 0.3798 0.5420 UoT_multitask_learn 0.7000 UoT_multitask_learn 0.2800 UoT_multitask_learn 0.4108 0.4636 FudanLabZhu4 0.6303 FudanLabZhu4 0.2900 FudanLabZhu4 0.4678 0.4627 8B batch 3 Umass_czi_4 0.9016 Umass_czi_4 0.3810 Umass_czi_4 0.4522 0.5782 Ours 0.9028 Ours 0.3601 Ours 0.4520 0.5716 pa-base 0.8995 pa-base 0.3137 pa-base 0.4585 0.5572 8B batch 4 Ours 0.7636 Ours 0.6078 Ours 0.4037 0.5917 91-initial-Bio 0.7204 91-initial-Bio 0.5735 91-initial-Bio 0.3905 0.5615 Features Fusion 0.7097 Features Fusion 0.5745 Features Fusion 0.3625 0.5489 8B batch 5 Ours 0.8518 Ours 0.5677 Ours 0.5582 0.6592 Parameters retrained 0.7509 Parameters retrained 0.5938 Parameters retrained 0.4004 0.5817 Features Fusion 0.7509 Features Fusion 0.6115 Features Fusion 0.3810 0.5811

Table 4: BioASQ 8B results on top 3 systems. The best scores were obtained from the BioASQ leaderboard (http://participants-area.bioasq.org/results/8b/phaseB/). We deduplicate system name if it has a similar name in upper scores. We report the macro average scores of all types in BioASQ. Bold denotes our systems.

5 Analysis

5.0.1 Order of Sequential Transfer Learning

max width= Order Importance  # of Task  Sequence of Transfer Learning Factoid (%) List (%)  SAcc  LAcc  MRR  Prec  Recall  F1 6B Test BioBERT-SQuAD-BioASQ 39.80 57.82 47.22 45.02 47.69 42.34 BioBERT-SQuAD-MNLI-BioASQ 41.15 57.95 47.29 46.18 44.56 40.98 BioBERT-MNLI-SQuAD-BioASQ 38.80 61.34 47.42 46.60 47.01 42.44 7B Test BioBERT-SQuAD-BioASQ 41.95 58.30 48.66 61.32 52.83 52.36 BioBERT-SQuAD-MNLI-BioASQ 43.31 58.69 49.24 60.77 50.74 50.72 BioBERT-MNLI-SQuAD-BioASQ 42.22 61.06 49.85 61.46 54.62 54.19

Table 5: Experiments of the order importance in sequential transfer learning. Factoid evaluation metrics are strict accuracy (SAcc), lenient accuracy (LAcc) and mean reciprocal rank (MRR). List evaluation metrics are precision (Prec), recall (Recall) and score of macro f1 (F1). Bold denotes the best score of the columns in each task.

The BioASQ Challenge Task 8B (phase B) results are shown in Table 4. Each team can submit their systems up to 5 times with different combinations of strategies. The 8B ground truth answers are not available to manually evaluate our suggested methods. Thus, we report the uploaded scores in leaderboard.222http://participants-area.bioasq.org/results/8b/phaseB/

In this analysis, we explore the order of sequential transfer learning and the results are shown in Table 5. For the factoid type of questions, we investigate that leveraging the MNLI dataset shows consistent improvement. On the other hand, in list type questions, the performance improves when the objective function of fine-tuned tasks are related to the BioASQ objective function. In other words, fine-tuning on the SQuAD dataset needs to be trained after the MNLI dataset.

5.0.2 Limitation of the Extractive QA Setting

max width= Type 7B Batch1 7B Batch2 7B Batch3 7B Batch4 7B Batch5 7B Total Factoid 0.359 (14/39) 0.120 (3/25) 0.310 (9/29) 0.118 (4/34) 0.229 (8/35) 0.216 (35/162) List 0.083 (1/12) 0.235 (4/17) 0.200 (5/25) 0.136 (3/22) 0.500 (6/12) 0.204 (18/88)

Table 6: Statistics of Unanswerable rate in the extractive QA setting. The cases of Ground Truth Answer cannot be exactly match in Human Annotated Corpus (Snippet). The unanswerable rate is related to the upper-bound.

So far, the problem setup has been done under the extractive QA setting. We transform factoid and list type questions into the same format as the SQuAD dataset. We sample examples from the BioASQ Challenge Task 7B (Phase B) test dataset which are unanswerable. Table 6 shows the unanswerable rate in all batches of 7B test datasets only for factoid and list type questions. We calculate this rate as a criterion of Ground Truth Answer cannot be exactly match in Human Annotated Corpus (Snippet). The criteria subsumes the cases of no exact match, lowercase match, additional phrase added, and different type of white space between exact answer and snippet. In Table 7, there is a clear upper bound to solve the biomedical questions in the BioASQ under the extractive QA setting. The limitations brought about when sampling the examples only exist in the BioASQ Task 7B (Phase B) test dataset, but we doubt that this unanswerable ratio also applies to the entire train dataset. Therefore, we have to consider clarifying the limitation when we use the extractive QA setting in the future. In the process of problem setup, we hope our analysis provides a better way to establish it.

max width= Limitation of Supervised Setting Type ID - Question - Context - Answer Factoid ID: 5c531d8f7e3cb0e231000017 Question: What causes Bathing suit Ichthyosis(BSI)? Ground Truth Answer: transglutaminase-1 gene (TGM1) mutations Context: Bathing suit ichthyosis (BSI) is an uncommon phenotype classified as a minor variant of autosomal recessive congenital ichthyosis (ARCI). OBJECTIVES: We report a case of BSI in a 3-year-old Tunisian girl with a novel mutation of the transglutaminase 1 gene (TGM1) List ID: 5c5214207e3cb0e231000003 Question: List potential reasons regarding why potentially important genes are ignored Ground Truth Answer: Identifiable chemical properties, Identifiable physical properties, Identifiable biological properties, Knowledge about homologous genes from model organisms Context: Here, we demonstrate that these differences in attention can be explained, to a large extent, exclusively from a small set of identifiable chemical, physical, and biological properties of genes. Together with knowledge about homologous genes from model organisms, these features allow us to accurately predict the number of publications on individual human genes, the year of their first report, the levels of funding awarded by the National Institutes of Health (NIH), and the development of drugs against disease-associated genes.

Table 7: Limitation of the extractive QA setting in BioASQ dataset. We sample examples of factoid and list type questions from 7B test dataset. Context recurs to the snippet which is human annotation suggested by organizer. Bold and underline denotes no exact match and exact match in lowercase respectively.

6 Conclusion

In our work, we use natural language inference (NLI) as a first step of fine-tuning for biomedical question answering (QA). Learning linguistic knowledge of entailment in sentence pairs enhances the performance in biomedical QA. We empirically demonstrate that leveraging NLI enhances the performance in the BioASQ Challenge. In this process, sequential transfer learning needs to be consider the order of sequence while training. Furthermore, we unify the distributions of context length to mitigate the discrepancy between NLI and biomedical QA. Finally, when converting the BioASQ dataset into SQuAD format, we analyze an intrinsic limitation of human annotation that an answer does not exactly match the context.

References

  • [1] E. Alsentzer, J. Murphy, W. Boag, W. Weng, D. Jindi, T. Naumann, and M. McDermott (2019) Publicly available clinical bert embeddings. In Proceedings of the 2nd Clinical Natural Language Processing Workshop, Cited by: §2.0.1.
  • [2] I. Beltagy, K. Lo, and A. Cohan SciBERT: a pretrained language model for scientific text. In Proceedings of the 2019 Conference on EMNLP-IJCNLP, Cited by: §2.0.1.
  • [3] S. Bowman, G. Angeli, C. Potts, and C. D. Manning A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on EMNLP, Cited by: §1, §2.0.2.
  • [4] S. Chen, Y. Hou, Y. Cui, W. Che, T. Liu, and X. Yu (2020) Recall and learn: fine-tuning deep pretrained language models with less forgetting. arXiv preprint arXiv:2004.12651. Cited by: §1, §2.0.1.
  • [5] C. Clark, K. Lee, M. Chang, T. Kwiatkowski, M. Collins, and K. Toutanova (2019) BoolQ: exploring the surprising difficulty of natural yes/no questions. In Proceedings of the 2019 Conference of the NAACL: Human Language Technologies, Volume 1 (Long and Short Papers), Cited by: §2.0.2, §3.3.1.
  • [6] J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the NAACL: Human Language Technologies, Cited by: §1, §2.0.1.
  • [7] D. Dimitriadis and G. Tsoumakas (2019) Yes/no question answering in bioasq 2019. In ECML PKDD, Cited by: §4.2, Table 1.
  • [8] S. Hosein et al. (2019) Measuring domain portability and errorpropagation in biomedical qa. arXiv preprint arXiv:1909.09704. Cited by: §1, §4.2, Table 1.
  • [9] J. Howard and S. Ruder (2018) Universal language model fine-tuning for text classification. In Proceedings of the 56th Annual Meeting of the ACL (Volume 1: Long Papers), Cited by: §2.0.1.
  • [10] Q. Jin, B. Dhingra, W. Cohen, and X. Lu (2019) Probing biomedical embeddings from language models. In Proceedings of the 3rd Workshop on Evaluating Vector Space Representations for NLP, Cited by: §2.0.1.
  • [11] D. Kim, J. Lee, C. H. So, H. Jeon, M. Jeong, Y. Choi, W. Yoon, M. Sung, and J. Kang (2019)

    A neural named entity recognition and multi-type normalization tool for biomedical text mining

    .
    IEEE Access. Cited by: §2.0.1.
  • [12] N. Kim, R. Patel, A. Poliak, P. Xia, A. Wang, T. McCoy, I. Tenney, A. Ross, T. Linzen, B. Van Durme, et al. (2019) Probing what different nlp tasks teach machines about function word comprehension. In Proceedings of the Eighth Joint Conference on Lexical and Computational Semantics (* SEM 2019), Cited by: §2.0.1.
  • [13] Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and R. Soricut (2019)

    Albert: a lite bert for self-supervised learning of language representations

    .
    arXiv preprint arXiv:1909.11942. Cited by: §1, §2.0.1.
  • [14] J. Lee, W. Yoon, S. Kim, D. Kim, C. So, and J. Kang (2019) BioBERT: a pre-trained biomedical language representation model for biomedical text mining.. Bioinformatics (Oxford, England). Cited by: §1, §2.0.1, §2.0.2, §4.2.
  • [15] H. Levesque et al. (2012) The winograd schema challenge. In Thirteenth International Conference on the Principles of Knowledge Representation and Reasoning, Cited by: §2.0.2.
  • [16] N. F. Liu, M. Gardner, Y. Belinkov, M. E. Peters, and N. A. Smith (2019) Linguistic knowledge and transferability of contextual representations. In Proceedings of the 2019 Conference of the NAACL: Human Language Technologies, Volume 1 (Long and Short Papers), Cited by: §2.0.1.
  • [17] M. Long, Y. Cao, J. Wang, and M. I. Jordan (2015) Learning transferable features with deep adaptation networks. arXiv preprint arXiv:1502.02791. Cited by: §2.0.1.
  • [18] S. Min, V. Zhong, R. Socher, and C. Xiong (2018) Efficient and robust question answering from minimal context over documents. In Proceedings of the 56th Annual Meeting of the ACL (Volume 1: Long Papers), Cited by: §1, §3.3.2.
  • [19] L. Mou, Z. Meng, R. Yan, G. Li, Y. Xu, L. Zhang, and Z. Jin How transferable are neural networks in nlp applications?. In Proceedings of the 2016 Conference on EMNLP, Cited by: §2.0.1.
  • [20] M. Oita et al. (2019) Semantically corroborating neural attention for biomedical question answering. In ECML PKDD, Cited by: §4.2, Table 1.
  • [21] Y. Peng, S. Yan, and Z. Lu (2019) Transfer learning in biomedical natural language processing: an evaluation of bert and elmo on ten benchmarking datasets. In Proceedings of the 18th BioNLP Workshop and Shared Task, Cited by: §2.0.1, §2.0.2.
  • [22] M. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer (2018) Deep contextualized word representations. In Proceedings of the 2018 Conference of the NAACL: Human Language Technologies, Volume 1 (Long Papers), Cited by: §1, §2.0.1.
  • [23] J. Phang et al. (2018) Sentence encoders on stilts: supplementary training on intermediate labeled-data tasks. arXiv preprint arXiv:1811.01088. Cited by: §1, §2.0.1, §2.0.2.
  • [24] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu (2019) Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv preprint arXiv:1910.10683. Cited by: §1.
  • [25] P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang SQuAD: 100,000+ questions for machine comprehension of text. In Proceedings of the 2016 Conference on EMNLP, Cited by: §1, §2.0.2, §4.1.
  • [26] M. Resta, D. Arioli, A. Fagnani, and G. Attardi (2019) Transformer models for question answering at bioasq 2019. In ECML PKDD, Cited by: §4.2, Table 1.
  • [27] A. Romanov and C. Shivade Lessons from natural language inference in the clinical domain. In Proceedings of the 2018 Conference on EMNLP, Cited by: §2.0.2.
  • [28] S. Ruder (2019) Neural transfer learning for natural language processing. Ph.D. Thesis. Cited by: §1, §2.0.1.
  • [29] A. Talmor and J. Berant (2019) MultiQA: an empirical investigation of generalization and transfer in reading comprehension. In Proceedings of the 57th Annual Meeting of the ACL, Cited by: §1.
  • [30] S. K. Telukuntla et al. (2019) UNCC biomedical semantic question answering systems. bioasq: task-7b, phase-b. In ECML PKDD, Cited by: §4.2, Table 1.
  • [31] G. Tsatsaronis, G. Balikas, P. Malakasiotis, I. Partalas, M. Zschunke, M. R. Alvers, D. Weissenborn, A. Krithara, S. Petridis, D. Polychronopoulos, et al. (2015) An overview of the bioasq large-scale biomedical semantic indexing and question answering competition. BMC bioinformatics. Cited by: §1.
  • [32] T. Vu, T. Wang, T. Munkhdalai, A. Sordoni, A. Trischler, A. Mattarella-Micke, S. Maji, and M. Iyyer Exploring and predicting transferability across nlp tasks. Cited by: §1, §2.0.2.
  • [33] A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. Bowman (2018) GLUE: a multi-task benchmark and analysis platform for natural language understanding. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, Cited by: §1, §2.0.2.
  • [34] G. Wiese et al. (2017) Neural domain adaptation for biomedical question answering. In Proceedings of the 21st Conference on CoNLL, Cited by: §1, §2.0.2, §4.2.
  • [35] A. Williams et al. (2018) A broad-coverage challenge corpus for sentence understanding through inference. In Proceedings of the 2018 Conference of the NAACL: Human Language Technologies, Volume 1 (Long Papers), Cited by: §1, §2.0.2, §4.1.
  • [36] W. Yoon, J. Lee, D. Kim, M. Jeong, and J. Kang (2019) Pre-trained language model for biomedical question answering. arXiv preprint arXiv:1909.08229. Cited by: §1, §2.0.1, §2.0.2, §3.3.2, §4.1, §4.2, §4.2, Table 1.
  • [37] J. Yosinski, J. Clune, Y. Bengio, and H. Lipson (2014) How transferable are features in deep neural networks?. In Advances in NIPS, Cited by: §2.0.1.

Appendix 0.A Training Details

MNLI Train Dev
Original 392,702 9,815
SQuAD v1.1 Train Dev
Original 87,412 10,570
Snippet 82,280 9,986
SQuAD v2.0 Train Dev
Original 130,319 11,873
BioASQ 6B 7B 8B
Type Data Strategy Train Test Train Test Train Test
Yes/No Snippet-as-is 9,421 127 10,560 140 11,531 152
Factoid Full-Abstract 7,911 9,403 10,147
Appended-Snippet 5,953 161 7,179 162 7,896 151
Snippet-as-is 3,512 4,231 4,759
List Full-Abstract 14,008 15,719 16,879
Appended-Snippet 10,878 81 12,184 88 13,251 75
Snippet-as-is 6,922 7,865 8,676
Table 8: Statistics of transferred dataset (MNLI & SQuAD) and target dataset (BioASQ).

We use BioBERT as learning biomedical entity representation. We utilize a single NVIDIA Titan RTX (24GB) GPU to fine-tune the sequence of transfer learning. In MNLI task, we use hyperparameters suggested by Hugging Face.

333https://github.com/huggingface/transformers/tree/master/examples/text-classification For fine-tuning, we select the batch size as 12, 24 and a learning rate is within range 1e-6 to 9e-6. In post-processing, we use the abbreviation resolution module called Ab3P444https://github.com/ncbi-nlp/Ab3P to remove the same answer appearance with a different form.