Log In Sign Up

SPBERTQA: A Two-Stage Question Answering System Based on Sentence Transformers for Medical Texts

by   Nhung Thi-Hong Nguyen, et al.

Question answering (QA) systems have gained explosive attention in recent years. However, QA tasks in Vietnamese do not have many datasets. Significantly, there is mostly no dataset in the medical domain. Therefore, we built a Vietnamese Healthcare Question Answering dataset (ViHealthQA), including 10,015 question-answer passage pairs for this task, in which questions from health-interested users were asked on prestigious health websites and answers from highly qualified experts. This paper proposes a two-stage QA system based on Sentence-BERT (SBERT) using multiple negatives ranking (MNR) loss combined with BM25. Then, we conduct diverse experiments with many bag-of-words models to assess our system's performance. With the obtained results, this system achieves better performance than traditional methods.


page 1

page 2

page 3

page 4


What do Models Learn from Question Answering Datasets?

While models have reached superhuman performance on popular question ans...

MeDiaQA: A Question Answering Dataset on Medical Dialogues

In this paper, we introduce MeDiaQA, a novel question answering(QA) data...

CMU LiveMedQA at TREC 2017 LiveQA: A Consumer Health Question Answering System

In this paper, we present LiveMedQA, a question answering system that is...

Knowledge-Aware Neural Networks for Medical Forum Question Classification

Online medical forums have become a predominant platform for answering h...

Look at the First Sentence: Position Bias in Question Answering

Many extractive question answering models are trained to predict start a...

Multi-passage BERT: A Globally Normalized BERT Model for Open-domain Question Answering

BERT model has been successfully applied to open-domain QA tasks. Howeve...

1 Introduction

Today, many websites have QA forums, where users can post their questions and answer other users’ questions. However, they usually take time to wait for responses. Moreover, data for question answering has become enormous, which means new questions inevitably have duplicate meanings from the questions in the database. In order to reduce latency and effort, QA systems based on information retrieval (IR) retrieving a good answer from the answer collection is essential. QA relies on open domain datasets such as texts on the web or closed domain datasets such as collections of medical papers like PubMed [PubMedQA] to find relevant passages. Moreover, in the COVID-19 pandemic, people care more about their health, and the number of questions posted on health forums has increased rapidly. Therefore, QA in the medical domain plays an important role. Lexical gaps between queries and relevant documents that occur when both use different words to describe similar contents have been a significant issue. Table 1

shows a typical example of this issue in our dataset. Previous studies applied word embeddings to estimate semantic similarity between texts to solve


. Various research studies approached deep neural networks and BERT to extract semantically meaningful texts

[laskar2020contextualized]. Primarily, SBERT has recently achieved state-of-the-art performance on several tasks, including retrieval tasks [Henderson]. This paper focuses on exploring fine-tuned SBERT models with MNR.

We contribute: (1) Introduce a ViHealthQA dataset containing 10,015 pairs in the medical domain. (2) Propose two-stage QA system based on SBERT with MNR loss. (3) Perform multiple experiments, including traditional models such as BM25, TF-IDF cosine similarity, and Language Model to compare our system.

ID 392
Question Tôi bị dị ứng thuốc kháng sinh và dị ứng khi ăn thịt cua đồng. Trường hợp của tôi có được tiêm vaccine phòng Covid-19 không? (I am allergic to antibiotics and eating crab meat. Can my case be vaccinated against Covid-19?)
Answer passage Trường hợp của anh theo hướng dẫn của Bộ Y tế là thuộc đối tượng cần cẩn trọng khi tiêm vaccine Covid-19 và tiêm tại bệnh viện hoặc cơ sở y tế có đầy đủ năng lực cấp cứu ban đầu. (According to the guidance of the Ministry of Health, your case is one of the subjects that need to be careful when injecting the Covid-19 vaccine and injecting it at a hospital or medical facility with total initial first aid capacity.)
Table 1: A typical example of Lexical gaps in ViHealthQA dataset.

2 Related work

In early-stage works of QA retrieval, several studies [Salton]

presented sparse vector models. Using unigram word counts, these models map queries and documents to vectors having many 0 values and rank the similarity values to extract potential documents. In 2008, Manning et al.

[Manning] did many experiments to gain a deeper understanding of the role of vectors, including how to compare queries with documents. Moreover, many researchers [Gery, Robertson] pay attention to BM25 methods in IR tasks.

IR methods with sparse vectors have a significant drawback: lexical gap challenges. The solution to this problem is using dense embedding to represent queries and documents. This idea was proposed early with the LSI approach [Deerwester]. However, the most well-known model is BERT. BERT applied encoders to compute embeddings for the queries and the documents. Liu et al. [liu] installed the final mean pooling layer and then calculated similarity values between outputs. Instead, Karpukhin et al. [karpukhin] used the initial CLS token. Many studies [laskar, Lee] applied BERT and reached significant results. Significantly, SBERT [Reimers] uses Siamese and triplet network structures to represent semantically meaningful sentence embeddings. Multiple research approaches have approached SBERT for Semantic Textual Similarity (STS) and Natural Language Inference (NLI) benchmarks. In 2021, Ha et al. [Ha] utilized SBERT to find similar questions in community question answering. They did several experiments on SBERT with multiple losses, including MNR loss.

Because of our task in the medical domain, we reviewed some related corpus. For example, CliCR [CliCR] comprises around 100,000 gap-filling queries based on clinical case reports, and MedQA [Zhang] includes answers for real-world multiple-choice questions. In Vietnam, Nguyen et al., 2021 [van2020new] published ViNewsQA, including 22,057 human-generated question-answer pairs. This dataset supports machine reading comprehension tasks.

3 Task description

There are question-answer passage pairs in the database. We have a collection of questions and a collection of answer passages . Our task is creating models with question belongs to collection can retrieve precise answer passage .

4 Dataset

4.1 Dataset characteristics

We release ViHealthQA, a novel Vietnamese dataset for question answering and information retrieval, including 10,015 question-answer passage pairs. We collect data from Vinmec111 and VnExpress222 websites by using the BeautifulSoup333 library. These ones are forums where users ask health-related questions answered by qualified doctors. The dataset consists of 4 features: index, question, answer passage, and link.

4.2 Overall statistics

After the collecting data phase, we divide our dataset into train, dev, and test sets. In particular, there are 7,009 pairs in Train, 993 pairs in Dev, and 2,013 pairs in Test (Table 3).

According to Table 3, most of the answer passages are in the range of 101 – 300 words (34.1%), the second ratio is the number of answer passages with 301 – 500 words (31.13%), followed by 501 - 700 words (15.88%), and 701 - 1000 words (9.98%). Longer answer passages (over 1000 words) comprise a small proportion (above 7.58%).

ViHealthQA Value
Train 7,009
Dev 993
Test 2,013
Average length answer 495.33
Average length question 103.87
Vocabulary (word) 18,271
Average number of sentences 3.95
Table 3: Distribution of the answer passage length (%).
Length Answer passage
Train Val Test All
< 100 1.24 1.31 1.64 1.33
101 – 300 34.46 34.34 32.89 34.1
301 – 500 31.13 30.72 31.25 31.13
501 – 700 15.99 16.31 15.2 15.88
701 – 1000 9.8 8.66 11.23 9.98
> 1000 7.38 8.66 7.8 7.58
Table 2: Statistics of ViHealthQA dataset.

4.3 Vocabulary-based analysis

To understand the medical domain, we use the WordClouds tool444 to display visual word frequency that appears commonly in the dataset (Figure 1). Table 4 shows the top 10 words with the most frequency. These words are related to the medical domain. Besides, users ask many questions about Coronavirus (COVID-19), children, inflammatory diseases, and allergies.

Table 4: Top 10 common words in the ViHealthQA dataset. No. Word Freq. English 1 bác sĩ 7790 doctor 2 3409 baby 3 xét nghiệm 3316 test 4 trẻ 3012 children 5 triệu chứng 2858 symptom 6 dị ứng 2628 allergic 7 mũi 2479 nose 8 da 1979 skin 9 tiêm chủng 1912 vaccination 10 gan 1856 liver Figure 1: Word distribution of ViHealthQA.

5 SPBERTQA: A Two-Stage Question Answering System Based on Sentence Transformers

In this paper, we propose a two-stage question answering system called SPBERTQA (Figure 2), including BM25-based sentence retriever and SBERT using PhoBERT fine-tuning with MNR loss. After training, the inputs (the question and the document collection) feed into BM25-SPhoBERT. Then, we rank the top K cosine similarity scores between sentence-embedding outputs to extract top K candidate documents.

Figure 2: Overview of our system.

5.1 BM25 Based Sentence Retriever

We aim to train the model by focusing on the meaningful knowledge of our dataset. Thus, we propose the sentence retriever stage that extracts the sentences in every answer passage the most relevant to the corresponding question. Moreover, this stage helps solve the obstacle of the maximum length sequence of every pre-trained BERT model is 512 tokens ( of PhoBERT = 248 tokens), while the number of answer passages over 300 tokens in Train accounts for above 65.47%.

We use BM25 for the first stage because BM25 mostly brings good results in IR systems [Robertson2]. Besides, most answer passages have below four sentences (Average number of sentences in every answer passage in Table 3), so we choose .

5.2 SBERT using PhoBERT and fine-tuning with MNR loss

Multiple negatives ranking (MNR) loss: MNR loss works great for IR, and semantic search [Henderson]

. The loss function is given by Equation (1).

In every batch, there are positive pairs (: question and positive answer passage), and each positive pair has random negative answer passages . The similarity between question and answer passage is cosine similarity. Moreover, is the Train size.

In the second stage, we use the pre-trained PhoBERT model. PhoBERT [Dat] is the first public large-scale monolingual language model for Vietnamese. PhoBERT pre-training approach is based on RoBERTa, which optimizes more robust performance. Then, we fine-tune PhoBERT with MNR loss.

6 Experiments

6.1 Comparative methods

We compare our system with traditional methods such as BM25, TFIDF-Cos, and LM; pre-trained PhoBERT; and fine-tuned SBERT such as BM25-SXMLR and BM25-SmBERT.

6.1.1 Bm25

BM25 is an optimized version of TF-IDF. Equation (2) portrays the BM25 score of document given a query . is the length of the average document. Moreover, BM25 adds two parameters: helps balance the value between term frequency and , and adjusts the importance of document length normalization. In 2008, Manning et al. [Manning] suggested reasonable values are and .

6.1.2 TF-IDF Cosine Similarity (TFIDF-Cos)

Cosine similarity is one of the most popular similarity measures applied to information retrieval applications and is superior to the other measures such as the Jaccard measure and Euclidean measure [Subhashini]. Given and as the respective TF-IDF bag-of-words of question and answer passage. The similarity between and is calculated by Equation (3) [Pathak].

6.1.3 Language Model (LM)

LM is a probabilistic model of text [Tan]

. Questions and answers are modeled based on a probability distribution over sequences of words. The original and basic method for using LM is unigram query likelihood (Equation (4)).

is the probability of the query q under the language model derived from

. denotes a background corpus to compute unigram probabilities to avoid 0 scores [Zhai]. Besides, various smoothing based on how to handle and .

6.1.4 PhoBERT

We directly use PhoBERT to encode question and answer passages. Then, we rank the top K answer passages having the highest cosine similarity scores with the corresponding question.

6.1.5 Bm25-Sxlmr

Similar to our model, but in the second stage, we use XLM-RoBERTa instead of PhoBERT. XLM-RoBERTa [Conneau] was pre-trained on 2.5TB of filtered CommonCrawl data containing 100 languages (including Vietnamese).

6.1.6 BM25-SmBERT

Similar to our model, but in the second stage, we use BERT multilingual. BERT multilingual was introduced by [pires2019multilingual]. This model is a transformers model pre-trained on the enormous Wikipedia corpus with 104 languages (including Vietnamese) using a masked language modeling (MLM) objective.

6.2 Data preprocessing

We pre-process data such as lowercase, removing uninterpretable characters (e.g., new-line and extra whitespace). In order to tokenize data, we employ the RDRSegmenter of VnCoreNLP [Vu]. Moreover, stop-words can become noisy factors for traditional methods working well on pairs with high word matching between query and answer. Therefore, we conduct the removing stop-words phase. Firstly, we use TF-IDF to extract stop-words, and then we remove these words from the data.

6.3 Experimental settings

We choose xlm-roberta-base555, bert-base-multilingual-cased666, and vinai/phobert-base777

. Then, we fine-tune SBERT with 15 epochs, batch size of 32, learning rate of

, and maximum length of 256. Our experiments are performed on a single NVIDIA Tesla P100 GPU on the Google Collaboratory server888

6.4 Evaluation metric

(Equation (5)) is the percentage of questions for which the exact answer passage appears in one of the retrieved passages [van2020new].

Where, : collection of questions and . : collection of answer passages. is exact answer-passage of question . is the most relevant passages extracted for question .

Besides, mean average precision (mAP) is used to evaluate the performance of models.

7 Results and Discussion

7.1 Results and discussion

With the results shown in Table 6 and 6, our system achieves the best performance with 62.25% mAP score, 50,92% score, and 83.76% score on the Test. BM25-SXLMR and BM25-SmBERT utilizing multilingual BERT do not work better than our system using monolingual PhoBERT. Compared to the PhoBERT model without fine-tuning with MNR loss, models fine-tuned with MNR (BM25-SXLMR, BM25-SmBERT, and our system) have good results, which proves that using MNR loss to fine-tune models for this task is suitable.

Dev Test Dev Test
BM25 51.86 44.96 75.93 70.09
LM 52.27 47.19 78.15 72.38
TFIDF-Cos 47.63 39.54 75.13 70.39
PhoBERT 8.36 6.95 31.72 23.10
53.58 46.05 85.90 79.04
49.85 44.91 81.97 75.71
Our system 69.52 50.92 89.12 83.76
Table 6: Results on Dev and Test with mAP score (%).
Model Dev Test
BM25 64.62 56.93
LM 56.01 56.00
TFIDF-Cos 57.12 50.31
PhoBERT 16.08 12.45
BM25-SXLMR 59.96 53.85
BM25-SmBERT 60.77 55.52
Our system 69.52 62.25
Table 5: Results on Dev and Test with score (%).

7.2 Analysis

Figure 3: Results of lexical overlap experiments with P@1 (%).

To understand deeply about our system is more robust than traditional methods, and traditional methods have disadvantages in lexical gap issues, we run models on pairs having lexical overlap (the number of duplicate words between question and answer passage – ) from 0 to 10. As results are shown in Figure 3, with , bag-of-words methods cannot extract the precise answer. Especially with , these models mostly do not work. While, with , fine-tuned models have results with an upper 50% score. From , these models have good scores with an upper 80% score. Moreover, we provide typical examples of Dev predicted by BM25, LM, and our system (Table 7). ID 169 has word matching between question and answer passage. The models that can retrieve precise answers are BM25, LM, and our system. In contrast, in ID 776, no words of question appear in the answer passage. Hence, the models must understand the semantic backgrounds instead of capturing high lexical overlap information to retrieve the precise answer. BERT models capture context and meaning better than bag-of-words methods [Han]. In particular, SBERT can derive semantically meaningful sentence embeddings [Reimers]. Therefore, our system based on sentence transformers can find the exact answer passage for the question with ID 776.

ID Question Answer passage Models
776 Tai biến, chân tay tê bì điều trị như thế nào? (How is the stroke and tingling in in hands and feet treated?) Nếu vấn đề chính là rối loạn điện giải, nhiễm trùng huyết và gan thận, bạn nên đưa bố đến khám chuyên khoa Nội tiết hoặc Nội tổng quát. Về thần kinh, bác sĩ khám cần xem lại phim CT/MRI não để đánh giá lại tổn thương não mới có thể có được kế hoạch phòng ngừa đột quỵ tái phát, điều trị giảm đau thần kinh và phục hồi chức năng tối ưu. (If the main problem is electrolyte disturbances, sepsis, and hepatobiliary disease, you should take him to see an Endocrinologist or General Internal Medicine. Neurologically, the examining doctor needs to review the brain CT/MRI film to re-evaluate the brain damage so that he can have a plan to prevent recurrent stroke, treat neuropathic pain, and restore optimal function.) Our system
169 Bệnh suy tủy xương vô căn có nguy hiểm không và điều trị thế nào? (Is bone marrow failure syndromes dangerous and how is treatment?) Suy tủy xương vô căn tùy thuộc vào từng giai đoạn thì cách điều trị khác nhau. Nếu số lượng máu quá thấp thì phải điều trị ức chế miễn dịch hoặc ghép tủy. Có những bệnh nhân không đáp ứng với điều trị, tuy nhiên cũng có nhiều bệnh nhân chữa khỏi. (Bone marrow failure syndromes depends on the stage, the treatment is different. If the blood count is too low, then immunosuppressive therapy or bone marrow transplant is required. Some patients do not respond to treatment, but many patients are cured.) BM25, LM, and our system
Table 7: Examples in Dev predicted by traditional methods and our system.

8 Conclusion and Future Work

In this paper, we created the ViHealthQA dataset that comprises 10,015 question-answer passage pairs in the medical domain. Every answer passage is a doctor’s reply to the corresponding user’s question, so the ViHealthQA dataset is suitable for real search engines. Secondly, we propose the SPBERTQA, a two-stage question answering system based on sentence transformers on our dataset. Our proposed system performs best over bag-of-word-based models and fine-tuned multilingual pre-trained language models. This system solves the problem of linguistic gaps.

In future, we plan to employ the machine reading comprehension (MRC) module. This module helps extract answer spans from answer passages so that users can comprehend the meaning of the answer faster.


Luan Thanh Nguyen was funded by Vingroup JSC and supported by the Master Scholarship Programme of Vingroup Innovation Foundation (VINIF), Vingroup Big Data Institute (VinBigdata), VINIF.2021.ThS.41.