In a clinical decision support system (CDSS), doctors and healthcare professionals require access to information from heterogeneous sources, such as research papers (Gorman et al., 1994; Schardt et al., 2007), electronic health records (Hanauer et al., 2015), clinical case reports (Fujiwara et al., 2018), reference works and knowledge base articles. Differential diagnosis is an important task where a doctor seeks to retrieve answers for non-factoid queries about diseases, such as “symptoms of IgA nephropathy” (see Figure 1). A relevant answer typically spans multiple sentences and is most likely embedded into the discourse of a long document (Yang et al., 2016b; Cohan et al., 2018).
Evidence-based medicine (EBM) has made efforts to structure physicians’ information needs into short structured question representations, such as PICO (patient, intervention, comparison, outcome) (Richardson et al., 1995) and—more general—well-formed background / foreground questions (Cheng, 2004). We support this important query intention and define a query as structured tuple of entity (e.g. a disease or health problem) and aspect. Our model is focused on clinical aspects such as therapy, diagnosis, etiology, prognosis, and others, which have been described in the literature previously by manual clustering of semantic question types (Huang et al., 2006) or crawling medical Wikipedia section headings (Arnold et al., 2019). In a CDSS, a doctor can express these query terms with identifiers from a knowledge base or medical taxonomy, e.g. UMLS, ICD-10 or Wikidata. The system will support the user in assigning these links by search and auto-completion operators (Schneider et al., 2018; Fujiwara et al., 2018), which allows us to use these representations as input for the answer retrieval task.
Several methods have been proposed to apply deep neural networks for effective information retrieval(Guo et al., 2016; Mitra et al., 2017; Dai et al., 2018) and question answering (Seo et al., 2017; Wang et al., 2017), also with focus on healthcare (Zhu et al., 2019; Jin et al., 2019). However, our CDSS scenario poses a unique combination of open challenges to a retrieval system:
Task coverage: Query intentions span a broad range in specificy and complexity (Huang et al., 2006; Nanni et al., 2017). For example, medical specialists may pose very precise queries that align with a pre-defined taxonomy and focus on rare diseases. On the other hand, nursing staff might have broader and more heterogeneous questions. However, in most cases we do not have access to task-specific training data, so training the model for a single intention is not feasible. We therefore require a generalized query representation that covers a broad range of intents and taxonomies, even with limited training data.
Domain adaptability: In many cases we do not even have textual data readily available at training time from all resources in a CDSS. However, we observe linguistic and semantic shifts between the heterogeneous types of text, e.g. different use of terms and abbreviations among groups of doctors. Therefore we face a zero-shot retrieval task that requires robust domain transfer abilities across diverse biomedical, clinical and healthcare text resources (Logeswaran et al., 2019).
Contextual coherence: Answers are often expressed as passages in context of a long document. Therefore the model needs to respect long-range dependencies such as the sequence of micro-topics that establish a coherent ‘train of thought’ in a document (Arora et al., 2016; Arnold et al., 2019). At the same time, the model is required to operate on a fine granularity (e.g., on sentence level) rather than on entire documents to be able to capture the boundaries of answers (Keikha et al., 2014).
Efficient neural information retrieval: Finally, all documents in the CDSS need to be accessible with fast ad-hoc queries by the users. Many question answering models are based on pairwise similarity, which is computationally too intensive when applied to large-scale retrieval tasks (Gillick et al., 2018). Instead, we require a continuous retrieval model that allows for offline indexing and approximate nearest neighbor search with high recall (Gillick et al., 2018), even for rare queries and with low latency in the order of milliseconds.
We approach these challenges and present Contextual Discourse Vectors (CDV)111Code and evaluation data is available at https://github.com/sebastianarnold/cdv, a neural document representation which is based on discourse modeling and fulfills the above requirements. Our method is the first to address answer retrieval with structured queries on long heterogeneous documents from the healthcare domain.
CDV is based on hierarchical layers to encode word, sentence and document context with bidirectional Long Short-Term Memory (BLSTM). The model uses multi-task learning(Caruana, 1997) to align the sequence of sentences in a long document to the clinical knowledge encoded in pre-trained entity and aspect vector spaces. We use a dual encoder architecture (Gillick et al., 2018), which allows us to precompute discourse vectors for all documents and later answer ad-hoc queries over that corpus with short latency (Gillick et al., 2019). Consequently, the model predicts similarity scores with sentence granularity and does not require an extra inference step after the initial document indexing.
We apply our CDV model for retrieving passages from various public health resources on the Web, including NIH documents and Patient articles, with structured clinical query intentions of the form . Because there is no training data available from most sources, we use a self-supervised approach to train a generalized model from medical Wikipedia texts. We apply this model to the texts in our evaluation in a zero-shot approach (Palatucci et al., 2009) without additional fine tuning.
In summary, the major contributions of this paper include:
We propose a structured entity/aspect healthcare query model to support the essential query intentions of medical professionals. Our task is focused on the efficient retrieval of answer passages from long documents of heterogeneous health resources.
We introduce CDV, a contextualized document representation for passage retrieval. Our model leverages a dual encoder architecture with BLSTM layers and multi-task training to encode the position of discourse topics alongside the document. We use the representations to answer queries using nearest neighbor search on sentence level.
Our model utilizes generalized language models and aligns them with clinical knowledge from medical taxonomies, e.g. pre-trained entity and aspect embeddings. Therefore, it can be trained with sparse self-supervised training data, e.g. from Wikipedia texts, and is applicable to a broad range of texts.
We prove the applicability of our CDV model with extensive experiments and a qualitative error analysis on nine heterogeneous healthcare resources. We provide additional entity/aspect labels for all datasets. Our model significantly outperforms existing document matching methods in the retrieval task and can adapt to different healthcare domains without fine-tuning.
In this paper, we first give an overview of related research (Section 2). Next, we introduce our query representation (Section 3). Then, we focus on the contextual document representation model (Section 4). Finally, we discuss the findings of our experimental evaluation of the healthcare retrieval task (Section 5) and summarize our conclusions (Section 6).
There is a large amount of work on question answering (QA) (Seo et al., 2017; Wang et al., 2017), also applied to healthcare (Abacha et al., 2019; Jin et al., 2019) which focuses primarily on factoid questions with short answers. Typically, these models are trained with labeled question-answer pairs. However, it was shown that these models are not suitable for extracting local aspects from long documents, and especially not for open-ended, long answer passages (Tellex et al., 2003; Keikha et al., 2014; Yang et al., 2016b; Zhu et al., 2019). We therefore frame our task as a passage retrieval problem, where the system’s goal is to extract a concise snippet (typically 5–20 sentences) out of a large number of long documents. Furthermore, following studies from EBM (Richardson et al., 1995; Cheng, 2004; Huang et al., 2006), we focus on structured healthcare queries instead of free-text questions.
Recently, new approaches have emerged that represent local information in the context of long documents. For example, Cohan et al. (2018) approach the problem as abstractive summarization task. The authors use hierarchical encoders to model the discourse structure of a document and generate summaries using an attentive discourse-aware decoder. In our prior work on SECTOR (Arnold et al., 2019)
, we apply a segmentation and classification method to long documents to identify coherent passages and classify them into 27 clinical aspects. The model produces a continuous topic embedding on sentence level using BLSTMs, which has similar properties to the micro-topics described earlier byArora et al. (2016) as discourse vector (“what is being talked about”).
We follow these ideas as the groundwork for our approach. Our proposed model is based on a hierarchical architecture to encode a continuous discourse representation. To the best of our knowledge, our model is the first to use discourse-aware representations for answer retrieval. Additionally, we address the problem of sparse training data and propose a multi-task approach for training the model with self-supervised data instead of labeled examples.
A baseline approach to the passage retrieval problem is to split longer documents into individual passages and rank them independently according to their relevance for the query. Passage matching has been done using term-based methods (Robertson and Jones, 1976; Salton and Buckley, 1988), most prominently in TF-IDF (Jones, 1972) or Okapi BM25 (Robertson et al., 1995). However, these methods usually do not perform well on long passages or when there is minimal word overlap between passage and query. Therefore, most neural matching models tackle this vocabulary mismatch using semantic vector-space representations.
Representation-based matching models aim to match the continuous representations of queries and passages using a similarity function, e.g. cosine distance. This can be done on sentence level (ARC-I (Hu et al., 2014)), which does not work well if queries are short and passages are longer than a few sentences. Therefore, most approaches learn distinct query and passage representations using feed-forward (DSSM (Huang et al., 2013)
) or CNN convolutional neural networks (C-DSSM(Shen et al., 2014)).
Interaction-based matching models focus on the complex interaction between query and passage. These models use CNNs on sentence level (ARC-II (Hu et al., 2014)), match query terms and words using word count histograms (DRMM (Guo et al., 2016)), word-level dot product similarity (MatchPyramid (Pang et al., 2016)), attention-based neural networks (aNMM (Yang et al., 2016a)), kernel pooling (K-NRM (Xiong et al., 2017)
) or convolutional n-gram kernel pooling (Conv-KNRM(Dai et al., 2018)). Eventually, Zhu et al. (2019) utilize hierarchical attention on word and sentence level (HAR) to capture interaction of the query with local context in long passages.
While interaction-based models can capture complex correlations between query and passage, these models do not include contextualized local information—e.g. long-range document context that comes before or after a passage—which might contain important information for the query. To overcome this problem, Mitra et al. (2017) combine document-level representations with interaction features in a deep CNN model (Duet). Wan et al. (2016) utilize BLSTMs (MVLSTM) to generate positional sentence representations across the entire document.
We combine the representation approach with interaction. Our proposed model is able to learn the interaction between the words of the passage and the discourse using a hierarchical architecture. At the same time, it encodes fixed sentence representations that we use to match query representations. Consequently, our model does not require pairwise inference between all query–sentence pairs, which is usually circumvented by re-ranking candidates (Gillick et al., 2018). Instead, our model requires only a single pass through all documents at index time. Furthermore, by encoding discourse-aware representations, the model is able to access long-range document context which is normally hidden after the passage split. We compare our approach to all the discussed matching models and review these properties again in Section 5.
3. Query Model
Our first challenge is to design a query model which can adapt to a broad number of healthcare answer retrieval tasks and utilizes the information sources available in a CDSS. In this section, we introduce a vector-space representation for this purpose.
We define a query as a structured tuple . This approach of using two complementary query arguments originates from the idea of structured background/foreground questions in EBM (Cheng, 2004) and has been used before in many triple-based retrieval systems (Adolphs et al., 2011). In our healthcare scenario, we restrict entities to be of type disease, e.g. “IgA nephropathy”, and aspects from the clinical domain, e.g. “symptoms”, “treatment”, or “prognosis”. We discuss these two spaces in Sections 3.1 and Section 3.2 and propose their combination in Section 3.3. In general, our model is not limited to the query spaces used in this paper and further extendable to a larger number of arguments.
3.1. Entity Space
The first part of our problem is to represent the entity in focus of a query. In contrast to interaction-based models, which are applied to query–document pairs, our approach is to decouple entity encoding and document encoding. Therefore we follow recent work in representation-based Entity Linking (Gillick et al., 2019) and embed textual knowledge from the clinical domain into this representation. Our goal is to generalize entity representations, so the model will be able to align to existing taxonomies without retraining. Therefore, our entity space must be as complete as possible: it needs to cover each of the entities that appear in the discourse training data, but also rare entities that we expect at query time, e.g. in the application. We must further provide a robust method for predicting unseen entities (Logeswaran et al., 2019)
. In contrast to highly specialized entity embeddings constructed from knowledge graphs or multimodal data(Beam et al., 2018), our generic approach is based on textual data and allows us to apply the model to different knowledge bases and domains.
3.1.1. Entity Embeddings
Our goal is to create a mapping of each entity in the knowledge base identified by its ID into a low-dimensional entity vector space 222we use as a placeholder for all embedding vector sizes, even if they are not equal. We train an embedding by minimizing the loss for predicting the entity from sentences in the entity descriptions:
denotes the parameters required to approximate the probability. We optimize using a bidirectional Long Short-Term Memory (BLSTM) (Hochreiter and Schmidhuber, 1997) to predict the entity ID from the words . We encode using Fasttext embeddings (Bojanowski et al., 2017) and use bloom filters (Serrà and Karatzoglou, 2017) to compress into a hashed bit encoding, allowing for less model parameters and faster training.
Subsequently, we extend the approach of Palangi et al. (2016) and define the embedding function as the average output of the hidden word states and at the first respectively last time step:
Finally, we generate entity embeddings by applying the embedding function to all descriptions available. In case of unseen entities, the embedding can be generated on-the-fly:
3.1.2. Training Data
We train the entity representation for diseases, syndromes and health problems using textual descriptions from various sources: Wikidata333https://www.wikidata.org/wiki/Q12136, UMLS444https://uts.nlm.nih.gov, GARD555https://rarediseases.info.nih.gov, Wikipedia abstracts, and the Diseases Database666http://www.diseasesdatabase.com. In total, the knowledge base contains over 27,000 entities identified by their Wikidata ID. We trained roughly 9,700 common entities with text from Wikipedia abstracts, while we used for rare entities only their name and short description texts.
3.2. Aspect Space
The second part of our problem is to represent the aspect in the query tuple. Here, we expect a wide range of clinical facets and we do not want to limit the users of our system to a specific terminology. Instead, we train a low-dimensional aspect vector space using the Fasttext skip-gram model (Bojanowski et al., 2017) on medical Wikipedia articles. This approach places words with similar semantics nearby in vector space and allows queries with morphologic variations using subword representations.
To find all possible aspects, we adopt prior work (Arnold et al., 2019) and collect all section headings from the medical Wikipedia articles. These headings typically consist of 1–3 words and describe the main topic of a section. We apply moderate preprocessing (lowercase, remove punctuation, split at “and—&”) to generate aspect embeddings using a BLSTM encoder with the same architecture as discussed above:
We train the embedding with over 577K sentences from Wikipedia (see Table 1). We observe that there is a vocabulary mismatch in the headings so that potentially synonymous aspects are frequently labeled with different headings, e.g. “types” / “classification” or “signs” / “symptoms” / “presentation” / “characteristics”. However, it is also possible that in some contexts these aspects are hierarchically structured, e.g. “presentation” refers to the visible forms of a symptom. Our vector-space representation reflects these similarities, so it is possible to distinguish between these nuances at query time.
3.3. Query Representation
Finally, we represent the query as a tuple of entity and aspect embeddings using vector concatenation ():
This query encoder constitutes the upper part of our dual encoder architecture shown in Figure 2. In the next section, we introduce our document representation that completes the lower part.
4. Contextualized Discourse Vectors
In this section, we introduce Contextual Discourse Vectors (CDV), a distributed document representation that focuses on coherent encoding of local discourse in the context of the entire document. The architecture of our model is shown in Figure 2. We approach the challenges introduced in Section 1 by reading a document at word and sentence level (Section 4.1) and encoding sentence-wise representations using recurrent layers at document level (Section 4.2). We use the representations to measure similarity between every position in the document and the query (Section 4.3). Our model is trained to match the entity/aspect vector spaces introduced in Section 3 using self-supervision (Section 4.4).
4.1. Sentence Encoder
The first group of layers in our architecture encodes the plain text of an entire document into a sequence of vector representations. As we expect long documents—the average document length in our test sets is over 1,200 words—we chose to reduce the computational complexity by encoding the document discourse at sentence-level. It is important to avoid losing document context and word–discourse interactions (e.g. entity names or certain aspect-specific terms) during this step. Furthermore, our challenge of domain adaptability requires the sentence encoder to be robust to linguistic and semantic shifts from text sources that differ from the training data.
Therefore we start at the input layer by encoding all words in a document into fixed low-dimensional word vectors using pre-trained word embeddings with subword information (see below). Next, we encode all sentences into sentence representations based on the words in the sentence. This will reduce the number of computational time steps from words in a document to sentences. We compare two approaches for this sentence encoding step:
4.1.1. Compositional Sentence Embeddings
4.1.2. Pooling-based Sentence Embeddings
Since we want the model to be able to focus on individual words, we apply a language model encoder. We use the recent BioBERT (Lee et al., 2019), a transformer model which is pre-trained with a large amount of biomedical context on sub-word level. To generate sentence vectors from the input sequence, we use pooling of the attention layers per sentence:
Finally, we concatenate a positional encoding to the sentence embeddings, which encodes some rule-based structural flags such as begin/end-of-document, begin/end-of-paragraph, is-list-item. This encoding helps to guide the document encoder through the structure of a document.
4.2. Document Encoder
The second group of layers in our architecture encodes the sequence of sentences over the document. The objective of these layers is to transform the word/sentence input space into discourse vector space—which will later match with query entity and aspect spaces—in the context of the document. To achieve contextual coherence
, we use the entire document as input for a recurrent neural network with parameters, which we optimize at training time to minimize the loss over the sequence:
We adopt the architecture of SECTOR (Arnold et al., 2019) and use bidirectional LSTMs to read the document sentence-by-sentence. We use a final dense layer (matrix and bias ) to produce the local discourse vectors for every sentence in .
The CDV matrix is our discourse-aware document representation which embeds all information necessary to decode contextualized entity and aspect information for .
4.3. Passage Scoring
The center layer in our architecture addresses our main task objective: find the passages with highest similarity to the query. In Section 3, we described generalized vector spaces for entities and aspects that we use as query representation for high task coverage. We train our discourse vectors to share the same vector spaces and . This enables us to run efficient neural information retrieval of multiple ad-hoc queries over the pre-computed CDV vectors later without having to re-run inference on the document encoder for each query. We store all vectors in an in-memory vector index that allows us to efficiently retrieve approximate nearest neighbors using cosine distance. Figure 3 shows the overall process of training, indexing and ad-hoc answer retrieval. Because we reuse entity and aspect embeddings for training, our document model ‘inherits’ the properties from these spaces, e.g. robustness for unseen and rare entities or aspects.
4.3.1. Discourse Decoder
To decode the individual entity and aspect predictions from , we utilize two learned decoder matrices with bias terms . We optimize these parameters by using a multi-task objective with shared weights (Caruana, 1997) to minimize the distance to the training labels :
4.3.2. Sentence Scoring
4.3.3. Answer Retrieval
The scoring operation yields a sentence-level histogram which describes the similarity between query and every sentence in a document. At this point, we have the opportunity to select a coherent set of sentences as answers similar to (Arnold et al., 2019). However, because all healthcare datasets that we use for evaluation in this paper already provide passage boundaries, we leave this step for future work. Instead, we use the average sentence score per passage for answer retrieval:
Figure 4 shows the scoring curves divided into entity , aspect and average . It is clearly visible that the model coherently predicts long-range dependencies for the entity “IgA nephropathy” over the entire document. The aspect similarity with “symptoms” is much more focused on single sentences.
4.4. Self-supervised Training
We train a generalized CDV model for all evaluation tasks by jointly optimizing all model parameters from sentence encoder, document encoder and passage scoring layers on a training set. For this task, we use the textual data about diseases and health problems available from Wikipedia. This process is self-supervised, because there exist no labeled query-answer pairs for these documents. Instead, we assign for each sentence a set of related entities and aspects
using simple heuristics:
We collected over 8,600 articles for training and removed all instances contained in any of the test sets. The collection contains information about over 8K entities and 15K aspects (see Table 2).
4.4.1. Discourse Encoding
We create the target objectives for training using the average of the label embeddings contained in the training entities and aspects on sentence level:
4.4.2. Optimized Loss Function
We observe a strong imbalance of entity and aspect labels over the course of a single document, for example when passages contain lists (very short sentences), rare entities or have uncommon headlines. To give the network the ability to capture these anomalies, especially with larger batch sizes, we use a robust loss function(Barron, 2019) which resembles a smoothed form of Huber Loss (Huber, 1992):
In the next section, we apply our CDV model to a healthcare answer retrieval task.
5. Experimental Results
We evaluate our CDV model and 14 baseline methods in an answer passage retrieval task. All models are trained using self-supervision on Wikipedia texts and applied as zero-shot task (Palatucci et al., 2009) (i.e. without further fine-tuning) to three diverse English healthcare datasets WikiSection, MedQuAD and HealthQA.
5.1. Evaluation Setup
As queries, we use tuples of the form . Because our task requires to retrieve the answers from over 4,000 passages and the interaction-based models in our comparison require computationally expensive pairwise inference, we evaluate all numbers on a re-ranking task (Gillick et al., 2018). We follow the setup of Logeswaran et al. (2019) and use BM25 (Robertson et al., 1995) to provide each model with a pre-filtered set of 64 potentially relevant passage candidates777This choice covers 80-91% of all true answers (depending on the dataset) as trade-off between task complexity and real-world applicability. The numbers reported for HealthQA in the original paper were evaluated by re-ranking ten candidates (one relevant, 3 partially relevant and 6 irrelevant) and are therefore not comparable.. To facilitate full recall in this model comparison, we add missing true answers to the candidates if necessary by overwriting the lowest-ranked false answers in the list and shuffle afterwards. We rank the candidate answers using exhaustive nearest neighbor search and leave the evaluation of indexing efficiency for future work. Next, we describe the datasets, metrics and methods used in our experiments.
|(r)1-2 (l0.5em)4-6 # documents||8,605||716||1,111||178|
|(r)1-2 (l0.5em)4-6 avg words/doc||977.6||1,396.7||811.1||1,449.4|
|all trained on Wikipedia||MAP||MAP||MAP|
|TF-IDF (Jones, 1972)||17.10||64.99||31.77||23.83||82.84||42.66||17.46||71.54||34.47|
|BM25 (Robertson et al., 1995)||23.87||71.26||38.89||29.48||86.11||48.89||22.55||73.27||38.45|
|ARC-I (Hu et al., 2014)||1.61||13.69||6.90||1.98||19.22||8.47||1.38||13.87||6.87|
|DSSM (Huang et al., 2013)||22.82||74.31||39.02||13.11||55.92||27.38||10.50||46.44||22.04|
|C-DSSM (Shen et al., 2014)||9.59||53.12||22.82||9.67||47.54||22.12||10.56||58.30||25.37|
|ARC-II (Hu et al., 2014)||10.38||53.62||23.61||9.19||47.58||21.66||11.26||58.85||26.09|
|DRMM (Guo et al., 2016)||24.96||67.56||39.24||34.52||82.35||51.51||21.80||80.24||40.03|
|MatchPyramid (Pang et al., 2016)||18.53||64.21||33.12||25.14||72.33||41.37||19.24||73.79||37.22|
|aNMM (Yang et al., 2016a)||4.77||32.17||14.03||7.15||37.18||17.08||3.74||27.20||12.07|
|KNRM (Xiong et al., 2017)||16.96||61.03||31.04||16.86||61.35||31.35||22.94||67.92||37.65|
|CONV-KNRM (Dai et al., 2018)||34.36||77.25||48.72||42.70||84.54||57.57||33.13||85.41||50.55|
|HAR (Zhu et al., 2019)||45.31||84.15||58.38||55.65||93.17||69.10||43.20||88.34||58.80|
|Duet (Mitra et al., 2017)||18.34||59.13||31.74||20.50||65.91||35.28||17.27||64.81||32.13|
|MVLSTM (Wan et al., 2016)||30.74||76.10||45.58||36.86||86.29||53.18||26.78||84.42||45.37|
5.1.1. Evaluation Datasets
We conduct experiments on three English datasets from the clinical and healthcare domain (see Table 2). From the documents provided, we use the plain text of the entire document body during model inference and the segmentation information for generating the passage candidates. From all queries provided, we use the entity labels (mention text, Wikidata ID) and aspect labels (UMLS canonical name). If entity and aspect identifiers were not provided by the dataset, we added them manually by asking three annotators from clinical healthcare to label them.
WikiSectionQA (Arnold et al., 2019) (WS) is a large subset of full-text Wikipedia articles about diseases, labeled with entity identifiers, section headlines and 27 normalized aspect classes. We extended this dataset for answer retrieval by constructing query tuples from every section containing the given entity ID and normalized aspect label. We included abstracts as “information”, but skipped sections labeled as “other”. We use the en_disease-test split for evaluation and made sure that none of the documents are contained in our training data.
MedQuAD (Abacha and Demner-Fushman, 2019) (MQ) is a collection of medical question-answer pairs from multiple trusted sources of the National Institutes of Health (NIH): National Cancer Institute (NCI)888https://www.cancer.gov, Genetic and Rare Diseases (GARD) 999https://rarediseases.info.nih.gov, Genetics Home Reference (GHR)101010https://ghr.nlm.nih.gov, National Institute of Diabetes and Digestive and Kidney Diseases (NIDDK) 111111http://www.niddk.nih.gov/health-information/health-topics/, National Institute of Neurological Disorders and Stroke (NINDS) 121212http://www.ninds.nih.gov/disorders/, NIH Senior Health 131313http://nihseniorhealth.gov/ and National Heart, Lung and Blood Institute (NHLBI)141414http://www.nhlbi.nih.gov/health/. We left out documents from Medline Plus due to property rights. Questions are annotated with structured identifiers for entities (UMLS CUI), aspect (semantic question type) and contain a long passage as answer. To make this dataset applicable to our method, we reconstructed the entire documents from the answer passages and kept only questions about diseases for evaluation. We filtered out documents with only one passage (these were always labeled “information”) and separated a random 25% test split from the remaining documents.
HealthQA (Zhu et al., 2019) (HQ) is a collection of consumer health question-answer pairs crawled from the website Patient151515https://patient.info. The answer passages were generated from sections in the documents and annotated by human labelers with natural language questions. We reconstructed the full documents from these sections. Additionally, our annotators added structured entity and aspect labels to all questions in the test split. Although some questions are not about diseases, we kept all of them to remain comparable with related work.
5.1.2. Evaluation Metrics
For all ranking experiments, we use Recall at top K (R@K) and Mean Average Precision (MAP) metrics. While R@1 measures if the top-1 answer is correct or not (similar to a question answering task), we also report R@10, which corresponds with the ability to retrieve all correct answers in a top-10 results list, and MAP, which considers the entire result list.
5.1.3. Baseline Methods
We evaluate two term-based matching functions as baseline: TF-IDF (Jones, 1972) and BM25 (Robertson et al., 1995). We used the implementation in Apache Lucene 8.2.0161616https://lucene.apache.org to retrieve passages containing entity and aspect of a query, e.g. “IgA nephropathy symptoms” from the index of all passages in the test dataset.
Additionally, we evaluate the following document matching methods from the literature: ARC-I and ARC-II (Hu et al., 2014), DSSM (Huang et al., 2013), C-DSSM (Shen et al., 2014), DRMM (Guo et al., 2016), MatchPyramid (Pang et al., 2016), aNMM (Yang et al., 2016a), Duet (Mitra et al., 2017), MVLSTM (Wan et al., 2016), KNRM (Xiong et al., 2017), CONV-KNRM (Dai et al., 2018) and HAR (Zhu et al., 2019). For implementing these models, we followed Zhu et al. (2019)
and used the open source implementation MatchZoo(Guo et al., 2019) with pre-trained glove.840B.300d vectors (Pennington et al., 2014). All models were trained with our self-supervised Wikipedia training set using queries containing the entity and lowercase heading, e.g. “IgA nephropathy ; symptoms” and applied to the test sets using queries of the same structure, instead of natural language questions.
5.2. Implementation Details
We implement our models with the following configurations: For the sentence encoding, we use either glove.6B.300d pre-trained GloVe vectors (+avg-glove), 128d fine-tuned Fasttext embeddings (+avg-fasttext) or the 768d pre-trained BioBERT (Lee et al., 2019)
language model (+pool-biobert). For the document encoding, we use two LSTM layers (one forward, one backward) with 512 dimensions each, a discourse vector dense layer with 256 dimensions, L2 batch normalization and tanh activation. The discourse decoder is a 128-dimensional output layer with tanh activation and Huber loss. The network is trained with stochastic gradient descent over 50 epochs using the ADAM optimizer with a batch size of 16 documents, a learning rate of, exponential decay per epoch, dropout and
weight decay regularization. We chose these parameters using hyperparameter search on the WikiSection validation set. During training, we restrict the maximum document length to 396 sentences and maximum sentence length to 96 tokens, due to memory constraints on the GPU.
The entity and aspect embeddings are trained with 128d Fasttext embeddings, followed by BLSTM and dense embedding layers, each with 128 dimensions and tanh activations. The output layer is configured with 1024 dimensions, sigmoid activation and BPMLL loss (Zhang and Zhou, 2006) to predict the Bloom filter hash () of the entity or aspect. The network is trained similarly to the CDV model, but we use 5 epochs with a batch size of 128 sentences, a learning rate of and dropout.
Table 3 shows the results on the answer passage retrieval task using CDV and document matching models on three healthcare datasets. We observe that CDV consistently achieves significantly better results than all term-based, representation-based and combined models across all datasets. In comparison with pairwise interaction-based models, our representation-based retrieval model outperforms all tested models on average, scores best on WikiSection and HealthQA and second best on the MedQuAD dataset. Retrieval time per query is 247ms (43ms) on average. Figure 5 further shows that we correctly match between 67.5% and 91.4% of entities in the datasets and resolve 49.4% to 66.3% of all aspects.
Comparison of Model Architectures
Our query model is able to match most of the questions in the entity/aspect scheme (see Section 5.4 for exceptions). The results show that term-based TF-IDF and BM25 models can solve the healthcare retrieval task sufficiently with R@10 . In contrast, none of the representation-based re-ranking models can achieve similar performance, except DSSM on WikiSectionQA. Most of the recent interaction-based and combined models outperform BM25 and have significant advantages on the MedQuAD dataset, which contains a large amount of generated information that can be matched exactly. We follow that simple word-level interactions are important for this task and representation-based models trade off this property for fast retrieval times.
Our CDV model performs well on all data sets, but shows a significant advantage on the Wikipedia-based WikiSectionQA dataset. Although all models are trained on the same data, the only model with similar behavior is DSSM. One possible reason is that entity embeddings are an important source for background information, and these are mainly based on Wikipedia descriptions. 99.9% of the WikiSectionQA entities are covered in our embedding, 97.1% on MedQuAD and only 69.29% on HealthQA, because it does not only contain diseases. Additionally, sentence embeddings provide different levels of background knowledge and language understanding. The pre-trained GloVe embedding can handle the task well, but is outperformed by our fine-tuned Fasttext embedding and the large BioBERT language model.
Domain Adaptability and Task Coverage
Figure 6 shows the performance of our CDV+avg-fasttext model across all data sources, most of them contained in MedQuAD. This distribution reveals that our model top-1 accuracy is stable in the adaptation to most sources except National Cancer Institute (CancerGov). However, we notice that R@10 performance is high among all sources except SeniorHealth. Figure 7 shows that R@10 performance across the most frequent aspects is over 93% in most cases, but with varying top-1 recall. We will address these errors in Section 5.4.
Impact of Contextual Dependencies
An important feature of our CDV model is that all score predictions are calculated on sentence level with respect to long-range context across the entire document. In Figure 4, we observe that the model is able to predict the entity (top curve) consistently over the document, although there are many coreferences in Wikipedia text. The aspect curve (center) clearly shows the beginning of the expected section “Symptoms” and the model is uncertain for the following sentences. Finally, the average score (bottom curve) shows a coherent prediction.
5.4. Discussion and Insights
We perform an error analysis on the predictions of the CDV+avg-fasttext model to identify main reasons for answer misranking. For this purpose we analyse samples in which the model ranks a wrong passage at the top-1 position. We look at 50 random mismatched samples per dataset to understand the individual challenges per source. We discuss the main findings in the following.
Figure 8 shows that a main source of entity errors comes from selecting passages that belong to related entities. This includes entities that are superclasses or subclasses of the gold truth, e.g. selecting a passage covering “Diarrhea” when “Chronic Diarrhea in Children” is the query entity. These errors are most significant in WikiSectionQA and MedQuAD, because HealthQA covers mostly common diseases. We especially observe this in samples from Genetic Home Reference and National Cancer Institute. Figure 6 shows that R@1 is low for samples from these sources, whereas their R@10 is high. That is because genetic conditions and cancer types inherently contain entities with very similar names and descriptions. For instance, we see “Spastic Paraplegia Type 8” falsely resolved to “Spastic Paraplegia Type 11”. As the representations are close to each other in vector space, the correct samples are almost always found within the top-10 ranked candidates, corresponding with the high R@10.
Likewise, we observe that in HealthQA 34% and in WikiSectionQA 16% of aspects are mismatched to related aspects. Figure 7 shows the distribution of aspects and the model’s ability to resolve them. It is salient that some aspects are especially difficult to resolve. Aside from the fact that these aspects are in the long tail, a further analysis reveals that they are often resolved to related aspects. For example, passages covering “classification” are often very similar and therefore confused with passages about “diagnosis” and “symptoms”. The same holds true for “prognosis” and “management”. Queries asking for “prevalence” of a disease are often resolved to “information” passages, because disease frequency is often mentioned in these introductory texts. In general, passages about related aspects often share similar tokens and document context, which makes their distinction more difficult.
25% of queries from HealthQA contain entities that are no diseases but procedures, drugs or other entity types. As our model is trained on textual data covering diseases only, we do not expect it to fully resolve these entities. However, we observe that the model is capable of finding the correct passage for 23% of unseen entities. This shows that while our model is not trained on such entity types, the fallback embedding described in 3.1.1 still allows to generalize even to non-diseases in these cases.
Evaluation vs. Real-World Application
We further identified a number of errors related to the structure of the evaluation, that would be less problematic or even beneficial in real-world application. The model frequently ranks passages to the top which answer the query but have a differing aspect assigned. We observe this in 24% of analysed samples from WikiSectionQA and 18% from HealthQA. This often seems to be caused by the non-discrete nature of topical aspects. In practice a passage can cover more than one aspect, but our evaluation does currently not capture this ambiguity. Additionally we find some mismatches between passages and their ground-truth aspects, which can be ascribed to writing errors in WikiSectionQA and labeling errors in HealthQA. Aspects in MedQuAD are less ambiguous in general and only 4% fall into this error class. Figure 5 shows that the model therefore resolves more aspects correctly for MedQuAD queries.
Irrelevant Text within Passage Boundaries
Another finding is that 28% of analysed samples from the MedQuAD dataset contain boilerplate text unrelated to a specific entity. The boilerplate includes repeated text such as information about how data was collected. In this case our model is able to detect relevant parts of a passage (see Figure 4), but the remaining irrelevant sentences lead to a worse ranking of the passage. Evaluating with flexible passage boundaries would eliminate this issue and be a better match for real-world scenarios, in which the interest of a medical professional is mainly focused on non-boilerplate parts of a document.
We find that most questions in our evaluation can be represented as tuples of entity and aspect without information loss. However, in 4% of analysed queries in the HealthQA dataset we see a mismatch between question and query. For instance, the question “How common is OCD in Children and Young People?” which is more specific than the assigned query tuple “Obsessive-compulsive disorder” and “prevalence”. Different solutions are possible for representing more complex queries, e.g. by composing multiple queries during retrieval. We leave these questions for future research.
6. Conclusions and Future Work
We present CDV, a contextualized document representation that uses structured entity/aspect queries to retrieve answer passages from long healthcare documents. Our model is based on a hierarchical dual encoder architecture which combines interaction-based feature encoding with low-latency continuous retrieval. In comparison to previous approaches, CDV is able to integrate document context into its representations, which helps to resolve long-range dependencies normally not visible to passage re-ranking models. We train a self-supervised model on medical Wikipedia texts and show that it applies to three healthcare answer retrieval tasks best or second best, compared to 14 strong baseline models. We trained all models using the same data and provide structured labels on three existing datasets for the evaluation of this task.
In future work, we will address the transfer of the CDV model to different languages, rare diseases and more complex question types. Another open challenge is the extraction of passage boundaries during retrieval. Furthermore, it is interesting to see how fine-tuning the model on supervised data will improve retrieval performance.
Acknowledgements.Our work is funded by the German Federal Ministry of Economic Affairs and Energy (BMWi) under grant agreement 01MD19003b (PLASS) and 01MD19013D (Smart-MD).
- A Question-Entailment Approach to Question Answering. BMC Bioinformatics 20 (1), pp. 511. Cited by: §5.1.1.
- Overview of the mediqa 2019 shared task on textual inference, question entailment and question answering. In Proceedings of the 18th BioNLP Workshop and Shared Task, pp. 370–379. Cited by: §2.
- YAGO-QA: Answering questions by structured knowledge queries. In Fifth International Conference on Semantic Computing, pp. 158–161. Cited by: §3.
- SECTOR: A Neural Model for Coherent Topic Segmentation and Classification. Transactions of the Association for Computational Linguistics 7, pp. 169–184. Cited by: item 3, §1, §2, §3.2, §4.2, §4.3.3, §5.1.1.
- A latent variable model approach to pmi-based word embeddings. Transactions of the Association for Computational Linguistics 4, pp. 385–399. Cited by: item 3, §2.
- A general and adaptive robust loss function. In , pp. 4331–4339. Cited by: §4.4.2.
- Clinical Concept Embeddings Learned from Massive Sources of Multimodal Medical Data. In Pacific Symposium on Biocomputing, Vol. 25, pp. 295–306. Cited by: §3.1.
- Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics 5, pp. 135–146. Cited by: §3.1.1, §3.2, §4.1.1.
- Multitask learning. Machine learning 28 (1), pp. 41–75. Cited by: §1, §4.3.1.
- A study of clinical questions posed by hospital clinicians. Journal of the Medical Library Association 92 (4), pp. 445. Cited by: §1, §2, §3.
A Discourse-Aware Attention Model for Abstractive Summarization of Long Documents. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 615–621. Cited by: §1, §2.
- Convolutional neural networks for soft-matching n-grams in ad-hoc search. In Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining, pp. 126–134. Cited by: §1, §2, §5.1.3, Table 3.
- PubCaseFinder: A case-report-based, phenotype-driven differential-diagnosis system for rare diseases. The American Journal of Human Genetics 103 (3), pp. 389–399. Cited by: §1, §1.
- Learning Dense Representations for Entity Retrieval. In Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL), pp. 528–537. Cited by: §1, §3.1.
- End-to-End Retrieval in Continuous Space. arXiv:1811.08008 [cs.IR]. Cited by: item 4, §1, §2, §5.1.
- Can primary care physicians’ questions be answered using the medical journal literature?. Bulletin of the Medical Library Association 82 (2), pp. 140. Cited by: §1.
- A deep relevance matching model for ad-hoc retrieval. In Proceedings of the 25th ACM International Conference on Information and Knowledge Management, pp. 55–64. Cited by: §1, §2, §5.1.3, Table 3.
- MatchZoo: A Learning, Practicing, and Developing System for Neural Text Matching. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 1297–1300. Cited by: §5.1.3.
- Supporting information retrieval from electronic health records: A report of University of Michigan’s nine-year experience in developing and using the Electronic Medical Record Search Engine (EMERSE). Journal of Biomedical Informatics 55, pp. 290–300. Cited by: §1.
- Long short-term memory. Neural Computation 9 (8), pp. 1735–1780. Cited by: §3.1.1.
- Convolutional neural network architectures for matching natural language sentences. In Advances in Neural Information Processing Systems, pp. 2042–2050. Cited by: §2, §2, §5.1.3, Table 3.
- Learning deep structured semantic models for web search using clickthrough data. In Proceedings of the 22nd ACM International Conference on Information & Knowledge Management, pp. 2333–2338. Cited by: §2, §5.1.3, Table 3.
- Evaluation of PICO as a knowledge representation for clinical questions. In AMIA Annual Symposium Proceedings, Vol. 2006, pp. 359. Cited by: item 1, §1, §2.
Robust estimation of a location parameter. In Breakthroughs in Statistics, pp. 492–518. Cited by: §4.4.2.
PubMedQA: A Dataset for Biomedical Research Question Answering.
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, pp. 2567–2577. Cited by: §1, §2.
- A Statistical interpretation of term specificity and its application to retrieval. Journal of Documentation 28 (1), pp. 11–21. Cited by: §2, §5.1.3, Table 3.
- Retrieving passages and finding answers. In Proceedings of the 2014 Australasian Document Computing Symposium, pp. 81. Cited by: item 3, §2.
- BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics, pp. 1–7. Cited by: §4.1.2, §5.2.
- Zero-Shot Entity Linking by Reading Entity Descriptions. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 3449–3460. Cited by: item 2, §3.1, §5.1.
Learning to match using local and distributed representations of text for web search. In Proceedings of the 26th International Conference on World Wide Web, pp. 1291–1299. Cited by: §1, §2, §5.1.3, Table 3.
- Benchmark for complex answer retrieval. In Proceedings of the ACM SIGIR International Conference on Theory of Information Retrieval, pp. 293–296. Cited by: item 1.
- Deep sentence embedding using long short-term memory networks: Analysis and application to information retrieval. IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP) 24 (4), pp. 694–707. Cited by: §3.1.1.
- Zero-shot learning with semantic output codes. In Advances in Neural Information Processing Systems, pp. 1410–1418. Cited by: §1, §5.
Text matching as image recognition.
Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, pp. 2793–2799. Cited by: §2, §5.1.3, Table 3.
- GloVe: Global Vectors for Word Representation. In Empirical Methods in Natural Language Processing, pp. 1532–1543. Cited by: §4.1.1, §5.1.3.
- The well-built clinical question: a key to evidence-based decisions. ACP Journal Club 123 (3), pp. A12–3. Cited by: §1, §2.
- Okapi at TREC-3. NIST Special Publication SP 109, pp. 109. Cited by: §2, §5.1.3, §5.1, Table 3.
- Relevance weighting of search terms. Journal of the American Society for Information Science 27 (3), pp. 129–146. Cited by: §2.
- Term-weighting approaches in automatic text retrieval. Information Processing & Management 24 (5), pp. 513–523. Cited by: §2.
- Utilization of the PICO framework to improve searching PubMed for clinical questions. BMC Medical Informatics and Decision Making 7 (1), pp. 16. Cited by: §1.
- Smart-MD: Neural Paragraph Retrieval of Medical Topics. In The Web Conference 2018 Companion, pp. 203–206. Cited by: §1.
- Bidirectional attention flow for machine comprehension. 5th International Conference on Learning Representations. Cited by: §1, §2.
- Getting deep recommenders fit: Bloom embeddings for sparse binary input/output networks. In Proceedings of the Eleventh ACM Conference on Recommender Systems, pp. 279–287. Cited by: §3.1.1.
- Learning semantic representations using convolutional neural networks for web search. In Proceedings of the 23rd International Conference on World Wide Web, pp. 373–374. Cited by: §2, §5.1.3, Table 3.
- Quantitative evaluation of passage retrieval algorithms for question answering. In Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Informaion Retrieval, pp. 41–47. Cited by: §2.
- A deep architecture for semantic matching with multiple positional sentence representations. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, pp. 2835–2841. Cited by: §2, §5.1.3, Table 3.
- Gated self-matching networks for reading comprehension and question answering. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, pp. 189–198. Cited by: §1, §2.
- End-to-end neural ad-hoc ranking with kernel pooling. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 55–64. Cited by: §2, §5.1.3, Table 3.
- aNMM: Ranking short answer texts with attention-based neural matching model. In Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, pp. 287–296. Cited by: §2, §5.1.3, Table 3.
- Beyond factoid QA: effective methods for non-factoid answer sentence retrieval. In European Conference on Information Retrieval, pp. 115–128. Cited by: §1, §2.
- Multilabel neural networks with applications to functional genomics and text categorization. IEEE transactions on Knowledge and Data Engineering 18 (10), pp. 1338–1351. Cited by: §5.2.
- A Hierarchical Attention Retrieval Model for Healthcare Question Answering. In The World Wide Web Conference, pp. 2472–2482. Cited by: §1, §2, §2, §5.1.1, §5.1.3, Table 3.