Results of the seventh edition of the BioASQ Challenge

06/16/2020
by   Anastasios Nentidis, et al.
0

The results of the seventh edition of the BioASQ challenge are presented in this paper. The aim of the BioASQ challenge is the promotion of systems and methodologies through the organization of a challenge on the tasks of large-scale biomedical semantic indexing and question answering. In total, 30 teams with more than 100 systems participated in the challenge this year. As in previous years, the best systems were able to outperform the strong baselines. This suggests that state-of-the-art systems are continuously improving, pushing the frontier of research.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

06/28/2021

Overview of BioASQ 2020: The eighth BioASQ challenge on Large-Scale Biomedical Semantic Indexing and Question Answering

In this paper, we present an overview of the eighth edition of the BioAS...
06/28/2021

Overview of BioASQ 2021: The ninth BioASQ challenge on Large-Scale Biomedical Semantic Indexing and Question Answering

Advancing the state-of-the-art in large-scale biomedical semantic indexi...
12/02/2019

SemEval-2017 Task 3: Community Question Answering

We describe SemEval-2017 Task 3 on Community Question Answering. This ye...
03/14/2018

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

We present a new question set, text corpus, and baselines assembled to e...
06/04/2020

AP20-OLR Challenge: Three Tasks and Their Baselines

This paper introduces the fifth oriental language recognition (OLR) chal...
07/16/2019

AP19-OLR Challenge: Three Tasks and Their Baselines

This paper introduces the fourth oriental language recognition (OLR) cha...
06/02/2018

AP18-OLR Challenge: Three Tasks and Their Baselines

The third oriental language recognition (OLR) challenge AP18-OLR is intr...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The aim of this paper is twofold. First, we aim to give an overview of the data issued during the BioASQ challenge in 2019. In addition, we aim to present the systems that participated in the challenge and evaluate their performance. To achieve these goals, we begin by giving a brief overview of the tasks, which took place from February to May 2019, and the challenge’s data. Thereafter, we provide an overview of the systems that participated in the challenge. Detailed descriptions of some of the systems are given in the workshop proceedings. The evaluation of the systems, which was carried out using state-of-the-art measures or manual assessment, is the last focal point of this paper, with remarks regarding the results of each task. The conclusions sum up this year’s challenge.

2 Overview of the Tasks

The challenge comprised two tasks: (1) a large-scale biomedical semantic indexing task (Task 7a) and (2) a biomedical question answering task (Task 7b). In this section a brief description of the tasks is provided focusing on differences from previous years and updated statistics about the corresponding datasets. A complete overview of the tasks and the challenge is presented in [Tsatsaronis2015].

2.1 Large-scale semantic indexing - 7a

In Task 7a the goal is to classify documents from the PubMed digital library into concepts of the MeSH hierarchy. Here, new PubMed articles that are not yet annotated by MEDLINE indexers are collected and used as test sets for the evaluation of the participating systems. Similarly to task 5a and 6a, articles from all journals were included in the test data sets of task 7a. As soon as the annotations are available from the MEDLINE indexers, the performance of each system is calculated using standard flat information retrieval measures, as well as, hierarchical ones. As in previous years, an on-line and large-scale scenario was provided, dividing the task into three independent batches of 5 weekly test sets each. Participants had 21 hours to provide their answers for each test set. Table

1 shows the number of articles in each test set of each batch of the challenge. 14,200,259 articles with 12.69 labels per article, on average, were provided as training data to the participants.

Batch Articles Annotated Articles Labels per Article
1 7,358 7,194 11.67
7,166 7,021 12.95
11,019 10,831 13.04
5,566 5,482 12.32
6,729 6,353 12.96
Total 37,838 36,881 12.31
2 6,380 6,098 12.51
6,785 6,621 12.75
6,207 5,927 12.75
7,382 7,079 13.00
7,240 6,756 12.65
Total 33,994 32,481 12.27
3 6,266 5,835 12.58
11,455 10,386 12.86
4,750 3,947 12.67
7,338 5,021 12.70
6,920 4,554 12.63
Total 36,729 29,743 12.14
Table 1: Statistics on test datasets for Task 7a.

2.2 Biomedical semantic QA - 7b

The goal of Task 7b was to provide a large-scale question answering challenge where the systems had to cope with all stages of a question answering task for four types of biomedical questions: “yes/no”, “factoid”, “list” and “summary” questions [balikas13]. As in previous years, the task comprised two phases: In phase A, BioASQ released 100 questions and participants were asked to respond with relevant elements from specific resources, including relevant MEDLINE articles, relevant snippets extracted from the articles, relevant concepts and relevant RDF triples. In phase B, the released questions were enhanced with relevant articles and snippets selected manually and the participants had to respond with exact answers, as well as with summaries in natural language (dubbed ideal answers). The task was split into five independent batches and the two phases for each batch were run with a time gap of 24 hours. In each phase, the participants received 100 questions and had 24 hours to submit their answers. Table 2 presents the statistics of the training and test data provided to the participants. The evaluation included five test batches.

Batch Size Documents Snippets
Train 2,747 11.14 13.91
Test 1 100 3.07 3.93
Test 2 100 2.64 3.22
Test 3 100 3.08 4.05
Test 4 100 2.78 3.71
Test 5 100 2.39 2.62
Total 3,247 9.85 12.31
Table 2: Statistics on the training and test datasets of Task 7b. All the numbers for the documents and snippets refer to averages.

3 Overview of Participants

3.1 Task 7a

For this task, 12 teams participated and results from 30 different systems were submitted. In the following paragraphs we describe those systems for which a description was available, stressing their key characteristics. An overview of the systems and their approaches can be seen in Table 3.

System Approach
ceb CNN, embeddings, ensembles
DeepMesh d2v, tf-idf, MESHlabeler, attention scheme, PLT
Iria bigrams, Luchene Index, k-NN, ensembles, UIMA ConceptMapper
MeSHProbeNet-P Bidirectional RNN (GRU), attention scheme, encoder-decoder architecture
Semantic NoSQL KE UIMA ConceptMapper, par2vec, DeepLearning4j111https://deeplearning4j.org/ Accessed June 2019
Table 3: Systems and approaches for Task 7a. Systems for which no description was available at the time of writing are omitted.

The National Library of Medicine (NLM) team, in its “ceb” systems [Rae2019]

, adopts an end-to-end deep learning architecture with Convolutional Neural Networks (CNN)

[liu2017deep] to improve the results of the Medical Text Indexer (MTI) [morkBioasq2014]. In particular, they combine text embeddings with journal information. They also consider information about the years of publication and indexing, to capture concept drift and variations in the MeSH vocabulary respectively. They also experiment with an ensemble of independently trained DL models.

The Fudan University team builds upon their previous “DeepMeSH

” systems, which are based on document to vector (

d2v) and tf-idf feature embeddings [peng2016], the MESHLabeler system [liu2015] and learning to rank (LTR). This year, they incorporate AttentionXML [You2018], a deep-learning-based extreme multi-label text classification model, in the “DeepMeSH”framework. In particular, AttentionXML combines a multi-label attention mechanism, to capture label-specific information, with a shallow and wide probabilistic label tree (PLT) [Jain2016], for improved efficiency.

The “Iria” systems [ribadascole] are based on the same techniques used by their systems for the previous version of the challenge which are summarized in Table 3 and described in the corresponding challenge overview [nentidis2017results].

The “MeSHProbeNet-P” systems are upgraded versions of MeSHProbeNet [Xun2019], which participated in BioASQ6 with the name “xgx

”. Their approach is based on an end-to-end deep learning model with an encoder-decoder architecture. The encoder consists of a recurrent neural network with multiple attentive MeSH probes to extract different aspects of biomedical knowledge from each input article. In “

MeSHProbeNet-P” the attentive MeSH probes are also personalized for each biomedical article, based on the domain of each article as expressed by the journal where it has been published.

Finally, the “Semantic NoSQL KE” system variants [Bernd2019] were developed extending previous year’s “SNOKE” systems. The systems are based on the ZB MED Knowledge Environment [ZBMed2017], utilizing the Snowball Stemmer [snowball:2000] and the UIMA [tanenblatt2010] ConceptMapper to find matches between MeSH terms and words in the title and abstract of each target document, adopting different matching strategies. Paragraph Vectors [Le2014] trained on the BioASQ corpus are used to rank and filter all the MeSH headings suggested by the UIMA-based framework for each document.

Similarly to the previous year, two systems developed by NLM to assist the indexers in the annotation of MEDLINE articles, served as baselines for the semantic indexing task of the challenge. MTI [morkBioasq2014] with some enchantments introduced in [zavorin2016] and an extension of it, incorporating features of the winning system of the first BioASQ challenge [tsoumakasBioasq].

3.2 Task 7b

The question answering task was tackled by 73 different systems, developed by 18 teams. In the first phase, which concerns the retrieval of information required to answer a question, 6 teams with 23 systems participated. In the second phase, where teams are requested to submit exact and ideal answers, 13 teams with 52 different systems participated. An overview of the technologies employed by each team can be seen in Table 4.

Systems Phase Approach
AUTH A, B MetaMap, BeCAS, Lucene Index, ElasticSearch, Wordnet, ELMo, SentiWordnet, w2vec, BiLSTM
AUEB A BM25, w2vec, BERT, DL (BCNN, PACRR, PDRMM)
MindLab A ElasticSearch, BM25, QuickUMLS, w2vec, WMD, DL (CNN)
_sys A Word and Sentence embeddings, Pseudo Relevance Feedback, BM25, LSI
BJUTNLP B SQUAD, GloVe, BiLSTM, Pointer Network
BIOASQ_VK B ELMo, DMN attention mechanisms, NLTK-VADER
DMIS B

BioBERT, SQUAD, transfer learning

google B BERT, CoQA, Natural Questions
L2PS B SQUAD, Quasar-T, DRQA (RNN, LSTM), PSPR (LSTM), BioBERT
LabZhu B PubTator, Stanford POS tool, SPARQL
MQU B

w2vec, tf-idf, DL (LSTM), Reinforcement Learning

UNCC B BioBERT, SQUAD, Stanford POS tool, AllenNLP entailment
unipi-quokka-QA B ELMo, ELMo-PubMed, BERT, BioBERT, SciSpacy
Table 4: Systems and approaches for Task7b. Systems for which no information was available at the time of writing are omitted.

The “AUTH” team participated in both phases of Task 7B, with focus on phase B. For the document retrieval task, they experimented with approaches based on the BioASQ search services and ElasticSearch, querying with the conjunction of words in each question for the top 10 documents. In Phase B, for factoid and list questions they used updated versions of their BioASQ6 system [Dimitriadis2019], based on word embeddings, MetaMap [AronsonL10], BeCAS [Nunes2013] and WordNet. For yes/no questions they experiment with different deep learning methods, based on ELMo embeddings [Peters2018], SentiWordnet [Esuli06sentiwordnet:a] and similarity matrices to represent the question/answer pairs and use them as input for different BiLSTM architectures [Dimitriadis2019bioasq].

The “AUEB” team participated in Phase A on document and snippet retrieval tasks yielding great results. They built upon their BioASQ6 document retrieval systems [Brokos2018, mcdonald2018], which they modify to yield a relevance score for each sentence and experiment with BERT and PACRR [mcdonald2018] for this task. For snippet retrieval, they utilize a BCNN [yin2016abcnn] model and a model based on POSIT-DRMM (PDRMM) [mcdonald2018]. They also introduce JPDRMM, a novel deep learning approach for joint document and snippet ranking, based on PDRMM [Pappas2019].

Another approach based on deep learning methodologies for Phase A, focusing again on document and snippet retrieval, was proposed by the “MindLaB” team from the National University of Colombia [Vargas2019]. For the document retrieval they use the BM25 model [robertson:1976] and ElasticSearch [gormley2015elasticsearch] for efficiency, along with a Word Mover’s Distance [kusner2015word] based re-ranking scheme. For snippet retrieval, as in the previous approach, they utilized a very large collection of PubMed articles to train a CNN with similarity matrices of question-answer pairs. More specifically, they employ the BioNLPLab222http://bio.nlplab.org Accessed June 2019 w2vec embeddings that take into account the Part of Speech of each word. Also, they deploy the QuickUMLS [soldaini2016quickumls] tool to create a cui2vec embedding for each snippet.

The “_sys” systems also participated in Phase A of Task 7B. These systems filter the queries, using stop-word lists and regular expressions, and expand them using word embeddings and pseudo-relevance feedback. Relevant documents are retrieved, utilizing Query Likelihood with bigrams and BM25, and reranked, based on Latent Semantic Indexing (LSI) and document vectors. In particular, document vectors based on averaging sentence embeddings are adopted. Finally, different lists of documents are merged to form the final result, considering the position of the documents in each list.

In phase B, most systems focused on using embeddings and deep learning methodologies to tackle the tasks. For example the “BJUTNLP” system utilizes the SQUAD Dataset for pre-training. The system uses both GloVe embeddings [pennington2014glove] (fine tuned during training) and character-level word embeddings (through a 1-dimensional CNN) as input to a BiLSTM model and for each question a Pointer Network [see2017get] is finally responsible for pinpointing the exact start and end position of the answer in the relevant snippets.

The “BIOASQ_VK” systems were based on BioBERT [lee2019biobert], but with novel modifications to allow the model to cope with yes/no, factoid and list questions [Kanjirangat2019]. They pre-trained the model on the SQUAD dataset (for factoid and list questions) and SQUAD2 (for yes/no questions) to leverage the small size of the BioASQ dataset and by exploiting different pre-/post-processing techniques they obtained great results on all subtasks.

The “DMIS” systems focused on the importance of the information (words, phrases and sentences) for a given question [Yoon2019]. To this end, sentence level embeddings based on ELMo embeddings [Peters2018] and attention mechanisms facilitated by Dynamic Memory Networks (DMN) [kumar2016ask]

are deployed. Moreover, sentiment analysis is performed on yes/no questions to guide the classification (positive corresponds to yes) using the NLTK-VADER

[hutto2014vader] tool.

The “google” systems [Hosein2019], focus on factoid questions and are based on BERT based models [devlin2018bert], specifically the one in [alberti2019bert] trained on the Natural Questions [kwiatkowski2019natural] dataset, while also utilizing the CoQA [reddy2019coqa] and the BioASQ datasets. They experiment with different input to the models, including the abstracts of relevant articles, the provided gold snippets and predicted relevant snippets. In particular, they focus on error propagation in end-to-end information retrieval and question answering systems, reaching the interesting conclusion that the information retrieval part is a bottleneck for such end-to-end QA systems.

Interesting results come from the “L2PS” team where they quantify the importance of pre-training and fine-tuning models for question answering and view the task under different regimes, namely Reading Comprehension (RC) and Open QA [Kamath2019]. For the RC regime they use DRQA’s document reader [chen2017reading] while for the Open QA they utilize the PSPR model [lin2018denoising]. They experiment with different datasets (SQUAD [rajpurkar2016squad] for RC and Quasar-T [dhingra2017quasar] for Open QA) for fine-tuning the models, as well as BioBert [lee2019biobert] embeddings to gain insights on the effect of the context length in this task.

The “LabZhu[zhang2015]

systems improved upon their systems from BioASQ6, with focus on exact answer generation. In particular, for factoid and list questions they developed two distinct approaches. One based on traditional information retrieval approaches, involving candidate answer generation and ranking, and one Knowledge-Graph based approach. In the latter approach, the answer type and the topic entity of the question are predicted and a SPARQL query is generated based on them and used to retrieve some results from the Knowledge Graph. Finally, the results of the two approaches are combined for the final answer of the question.

The Macquarie University (“MQU”) team focused on ideal answers and approached the task under a classification approach for snippet relevance [Molla2019]. Extending their previous work [Diego2017, molla2018macquarie] the snippets are marked as summary relevant or not, utilizing w2vec embeddings and tf-idf vectors of the question-sentence pairs, showcasing that a classification scheme is more appropriate than a regression one. Also, based on their previous work [molla_REINFORCE:2017], they conduct experiments using reinforcement learning towards the ROUGE score of the ideal answers and a correlation analysis between various ROUGE metrics and the BioASQ human evaluation scores, observing poor correlation of the ROUGE-Recall score with human evaluation.

The “UNCC” team focused on factoid, list and yes/no questions [Telukuntla2019]. Their work is based on the BioBERT [lee2019biobert] embeddings fine-tuned on previous years of BioASQ. They also utilize the SQUAD dataset for factoid answers and incorporated the Lexical Answer Type (LAT) [ferrucci2010building] and POS-tags along with hand made rules to address specific errors of the system. Furthermore, they incorporated the entailment of the candidate sentences in yes/no questions using the AllenNLP library[Gardner2017AllenNLP].

Finally, the “unipi-quokka-QA” system tackled all the different question types in phase B [Resta2019]. Their work focused on experimenting with different Transformer models and embeddings, namely: ELMo, ELMo-Pumbed, BERT and BioBERT. They used different strategies depending on the question type, such as ensembles on yes/no questions, biomedical named entity extraction (using SciSpacy [Neumann2019ScispaCyFA]) on list questions and different pre-/post-processing procedures.

In this challenge too, the open source OAQA system proposed by

[yang2016learning] served as baseline for phase B. The system which achieved among the highest performances in previous versions of the challenge remains a strong baseline for the exact answer generation task. The system is developed based on the UIMA framework. ClearNLP is employed for question and snippet parsing. MetaMap, TmTool [Wei2016], C-Value and LingPipe [baldwin2003lingpipe] are used for concept identification and UMLS Terminology Services (UTS) for concept retrieval. The final steps include identification of concept, document and snippet relevance, based on classifier components and scoring, ranking and reranking techniques.

4 Results

4.1 Task 7a

Each of the three batches of Task 7a were evaluated independently. The classification performance of the systems were measured using flat and hierarchical evaluation measures [balikas13]. The micro F-measure (MiF) and the Lowest Common Ancestor F-measure (LCA-F) were used to choose the winners for each batch [kosmopoulos2015evaluation].

According to [Demsar06] the appropriate way to compare multiple classification systems over multiple datasets is based on their average rank across all the datasets. On each dataset the system with the best performance gets rank 1.0, the second best rank 2.0 and so on. In case two or more systems tie, they all receive the average rank. Table 5 presents the average rank (according to MiF and LCA-F) of each system over all the test sets for the corresponding batches. Note, that the average ranks are calculated for the 4 best results of each system in the batch according to the rules of the challenge.

System Batch 1 Batch 2 Batch 3
MiF LCA-F MiF LCA-F MiF LCA-F
DeepMeSH5 - - 1,00 1,00 1 1
DeepMeSH4 - - 9,50 9,50 2,25 1,75
DeepMeSH3 8,25 8,50 3,50 5,00 2,5 2,75
DeepMeSH1 5,00 6,25 2,00 2,63 3,75 4,13
DeepMeSH2 7,25 7,25 3,50 4,50 4,75 4,38
MeSHProbeNet-P2 2,63 2,63 4,63 5,88 6,5 8,25
MeSHProbeNet-P1 3,25 2,13 6,38 4,25 6,88 6,5
MeSHProbeNet-P3 5,00 4,63 8,38 7,25 7,5 7,38
MeSHProbeNet-P 2,38 3,25 7,00 4,38 8,13 7,75
MeSHProbeNet-P0 1,50 1,25 6,25 5,63 8,75 7,88
ceb 1 ensemble - - - - 11 11
Default MTI 9,75 8,75 12,00 11,75 12,25 12,25
ceb1 8,75 9,25 11,00 11,25 12,25 13,5
MTI First Line Index 11,50 11,25 13,00 12,50 13,25 12
iria-mix - - 14,00 14,00 14,5 14,75
Semantic NoSQL KE 2 - - - - 16 16
Semantic NoSQL KE 1 - - - - 17 17,75
Table 5: Average system ranks across the batches of the Task 7a. A hyphenation symbol (-) is used whenever the system participated in fewer than 4 tests in the batch. Systems with fewer than 4 participations in all batches are omitted.

The results in Task 7a show that in all test batches and for both flat and hierarchical measures, some systems outperform the strong baselines. In particular, The “MeSHProbeNet-P” systems achieve the best performance in the first batch, outperformed by the “DeepMeSH” systems in the last two batches. More detailed results can be found in the online results page333http://participants-area.bioasq.org/results/7a/. Comparison of these results with corresponding system results from previous years reveals the improvement of both the baseline and the top performing systems through the years of the competition as shown in Figure 1.

Figure 1: The micro f-measure achieved by systems across different years of the BioASQ challenge. For each test set the micro F-measure is presented for the best performing system (Top) and the MTI, as well as the average micro f-measure of all the participating systems (Avg).

4.2 Task 7b

System Mean Precision Mean Recall Mean F-measure MAP GMAP
aueb-nlp-2 0.2060 0.4039 0.2365 0.2114 0.0075
aueb-nlp-1 0.2124 0.4083 0.2440 0.2086 0.0065
aueb-nlp-5 0.2157 0.4235 0.2467 0.1821 0.0098
MindLab QA Reloaded 0.1587 0.2760 0.1723 0.1527 0.0013
Deep ML methods for 0.1331 0.2692 0.1589 0.1234 0.0009
MindLab Red Lions++ 0.1371 0.2538 0.1535 0.1187 0.0014
aueb-nlp-3 0.1488 0.3427 0.1779 0.1149 0.0053
MindLab QA System ++ 0.1288 0.2049 0.1364 0.1136 0.0010
aueb-nlp-4 0.1520 0.3237 0.1791 0.1116 0.0056
MindLab QA System 0.1297 0.2536 0.1478 0.1094 0.0016
lh_sys1 0.0399 0.0810 0.0478 0.0178 0.0001
lh_sys3 0.0233 0.0437 0.0266 0.0151 0.0001
lh_sys5 0.0233 0.0437 0.0266 0.0151 0.0001
lh_sys4 0.0233 0.0437 0.0266 0.0148 0.0001
lh_sys2 0.0182 0.0281 0.0193 0.0051 0.0001
Table 6: Results for snippet retrieval in batch 4 of phase A of Task 7b.
System Mean Precision Mean Recall Mean F-measure MAP GMAP
aueb-nlp-4 0.1750 0.6266 0.2471 0.1199 0.0151
aueb-nlp-2 0.1740 0.6139 0.2449 0.1121 0.0156
aueb-nlp-5 0.3599 0.6128 0.4034 0.1102 0.0164
aueb-nlp-1 0.1700 0.5912 0.2380 0.1041 0.0118
auth-qa-1 0.2675 0.3896 0.2894 0.1033 0.0018
aueb-nlp-3 0.1600 0.5806 0.2266 0.0986 0.0104
lh_sys4 0.1420 0.5490 0.2081 0.0920 0.0069
Ir_sys1 0.1410 0.5365 0.2059 0.0907 0.0059
lh_sys1 0.1420 0.5449 0.2076 0.0881 0.0063
MindLab QA Reloaded 0.1330 0.5288 0.1950 0.0863 0.0062
Table 7: Results for document retrieval in batch 3 of phase A of Task 7b. Only the top-10 systems are presented.
System Yes/No Factoid List
Acc. F1 Str. Acc. Len. Acc. MRR Prec. Rec. F1
BioBERT-DMIS-3 0.8286 0.8250 0.2857 0.4286 0.3452 0.5653 0.4131 0.4619
BioBERT-DMIS 0.8000 0.7822 0.2571 0.4571 0.3224 0.5236 0.3714 0.4202
unipi-quokka-QA-5 0.8000 0.7939 0.0857 0.1714 0.1152 0.1713 0.5873 0.2537
BioBERT-DMIS-2 0.7429 0.7200 0.2571 0.4571 0.3271 0.5486 0.3992 0.4468
BioBERT-DMIS-4 0.7429 0.7351 0.2286 0.4571 0.3238 0.5069 0.3575 0.4051
google-gold-input-ab 0.7143 0.6941 0.2286 0.2857 0.2571 0.1774 0.4175 0.2415
unipi-quokka-QA-4 0.7143 0.6941 0.0857 0.1714 0.1152 0.1713 0.5873 0.2537
unipi-quokka-QA-3 0.6857 0.6578 0.0857 0.1714 0.1152 0.1713 0.5873 0.2537
google-gold-input 0.6571 0.6023 0.2857 0.3714 0.3167 0.2159 0.4452 0.2824
DMIS 0.6571 0.6023 0.2857 0.5143 0.3638 0.5050 0.3714 0.4124
BioASQ_Baseline 0.4857 0.4643 0.0571 0.1429 0.0867 0.2127 0.3619 0.2573
Table 8: Results for batch 5 for exact answers in phase B of Task 7b. Only the top-10 systems are presented along with the BioASQ baseline.

Phase A: For phase A and for each of the four types of annotations: documents, concepts, snippets and RDF triples, we rank the systems according to the Mean Average Precision (MAP) measure. The final ranking for each batch is calculated as the average of the individual rankings in the different categories. In Tables 6 and 7 some indicative results from batches 3 and 4 are presented. Full results are available in the online results page of Task 7b, phase A444http://participants-area.bioasq.org/results/7b/phaseA/. These results are preliminary. The final results for Task 7b, phase A will be available after the manual assessment of the system responses.

Phase B: In phase B of Task 7b the systems were asked to produce exact and ideal answers. For ideal answers, the systems will eventually be ranked according to manual evaluation by the BioASQ experts [balikas13]. Regarding exact answers555For summary questions, no exact answers are required, the systems were ranked according to accuracy, F1 score on prediction of yes answer, F1 on prediction of no and macro-averaged F1 score for the yes/no questions, mean reciprocal rank (MRR) for the factoids and mean F-measure for the list questions. Table 8 shows the results for exact answers for the last batch of Task 7b. These results are preliminary. The full results of phase B of Task 7b are available online666http://participants-area.bioasq.org/results/7b/phaseB/. The final results for Task 7b, phase B will be available after the manual assessment of the system responses.

The results presented in Figure 2 show that this year the performance of systems in the yes/no questions, has clearly improved. In batch 5 for example, presented in Table 8, some systems outperformed the strong baseline based on previous versions of the OAQA system, with the top system achieving almost double the score of the baseline. Some improvement is also observed in the performance of the top systems for factoid and list questions in the preliminary results. However, there is even more room for improvement in these types of question as can be seen in Figure 2.

Figure 2: The performance achieved by systems in exact answer generation part of Task B, Phase B, across different years of the BioASQ challenge. For each test set the performance of the best performing system (Top) is presented based on the official evaluation measures. Since BioASQ6 the macro-averaged F1 score (macro F1) is the official measure for Yes/No questions, but accuracy (Acc), the former official measure, is also presented. The results for BioASQ7 are preliminary. The final results for Task 7b, phase B will be available after the manual assessment of the system responses.

5 Conclusions

In this paper, an overview of the seventh BioASQ challenge is presented. The challenge consisted of two tasks: semantic indexing and question answering. Overall, as in previous years, the best systems were able to outperform the strong baselines provided by the organizers. This suggests that advances over the state of the art were achieved through the BioASQ challenge but also that the benchmark in itself is challenging. Moreover, the shift towards systems that incorporate ideas based on deep learning models observed in the previous year, is even more clear. Novel ideas have been tested and state-of-the-art deep learning methodologies have been adapted to biomedical question answering with great results. Specifically, the breakthroughs in different NLP tasks using clever techniques with the advent of new language-models, such as BERT and gpt-2, gave birth to new approaches that significantly boost the performance of the systems. In the future, we expect novel methodologies, such as the newly proposed XLNet

[DBLP:journals/corr/abs-1906-08237], to further cultivate research in the biomedical information systems field. Consequently, we believe that the challenge is successfully pushing the research frontier of this domain. In future editions of the challenge, we aim to provide even more benchmark data derived from a community-driven acquisition process.

Acknowledgments

Google was a proud sponsor of the BioASQ Challenge in 2018. The seventh edition of BioASQ is also sponsored by the Atypon Systems inc. BioASQ is grateful to NLM for providing baselines for task 7a and to the CMU team for providing the baselines for task 7b. Finally, we would also like to thank all teams for their participation.

References