DSR: A Collection for the Evaluation of Graded Disease-Symptom Relations

01/15/2020 ∙ by Markus Zlabinger, et al. ∙ Johannes Kepler University Linz 0

The effective extraction of ranked disease-symptom relationships is a critical component in various medical tasks, including computer-assisted medical diagnosis or the discovery of unexpected associations between diseases. While existing disease-symptom relationship extraction methods are used as the foundation in the various medical tasks, no collection is available to systematically evaluate the performance of such methods. In this paper, we introduce the Disease-Symptom Relation collection (DSR-collection), created by five fully trained physicians as expert annotators. We provide graded symptom judgments for diseases by differentiating between "symptoms" and "primary symptoms". Further, we provide several strong baselines, based on the methods used in previous studies. The first method is based on word embeddings, and the second on co-occurrences of keywords in medical articles. For the co-occurrence method, we propose an adaption in which not only keywords are considered, but also the full text of medical articles. The evaluation on the DSR-collection shows the effectiveness of the proposed adaption in terms of nDCG, precision, and recall.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Disease-symptom knowledge bases are the foundation for many medical tasks – including medical diagnosis [10] or the discovery of unexpected associations between diseases [14, 2]. Most knowledge bases only capture a binary relationship between diseases and symptoms, neglecting the degree of the importance between a symptoms and a disease. For example, abdominal pain and nausea are both symptoms of an appendicitis, but while abdominal pain is a key differentiating factor, nausea does only little to distinguish appendicitis from other diseases of the digestive system. While several disease-symptom extraction methods have been proposed that retrieve a ranked list of symptoms for a disease [14, 11, 8, 13], no collection is available to systematically evaluate the performance of such methods [12]. While these method are extensively used in downstream tasks, e.g., to increase the accuracy of computer-assisted medical diagnosis [10], their effectiveness for disease-symptom extraction remains unclear.

In this paper, we introduce the Disease-Symptom Relation Collection (dsr-collection) for the evaluation of graded disease-symptom relations. The collection is annotated by five physicians and contains 235 symptoms for 20 diseases. We label the symptoms using graded judgments [6], where we differentiate between: relevant symptoms (graded as 1) and primary symptoms (graded as 2). Primary symptoms—also called cardinal symptoms—are the leading symptoms that guide physicians in the process of disease diagnosis. The graded judgments allow us for the first time to measure the importance of different symptoms with grade-based metrics, such as nDCG [5].

As baselines, we implement two methods from previous studies to compute graded disease-symptom relations: In the first method [11]

, the relation is the cosine similarity between the word vectors of a disease and a symptom, taken from a word embedding model. In the second method 

[14], the relation between a disease and symptom is calculated based on their co-occurrence in the MeSH-keywords111MeSH-keywords are meta-data that indicates the core topics of an medical article. of medical articles. We describe limitations of the keyword-based method [14] and propose an adaption in which we calculate the relations not only on keywords of medical articles, but also on the full text and the title.

We evaluate the baselines on the dsr

-collection to compare their effectiveness in the extraction of graded disease-symptom relations. As evaluation metrics, we consider precision, recall, and nDCG. For all three metrics, our proposed adapted version of the keyword-based method outperforms the other methods, providing a strong baseline for the


The contributions of this paper are the following:

  • We introduce the dsr-collection for the evaluation of graded disease-symptom relations. We make the collection freely available to the research community.222Contact this paper’s first author to gain access.

  • We compare various baselines on the dsr-collection to give insights on their effectiveness in the extraction of disease-symptom relations.

2 Disease-Symptom Relation Collection

In this section, we describe the new Disease-Symptom Relation Collection (dsr-collection) for the evaluation of disease-symptom relations. We create the collection in two steps: In the first step, relevant disease-symptom pairs (e.g. appendicitis-nausea) are collected by two physicians. They collect the pairs in a collaborative effort from high-quality sources, including medical textbooks and an online information service333The website netdoktor.at which is certificated by the Health on the Net Foundation. that is curated by medical experts.

In the second step, the primary symptoms of the collected disease-symptom pairs are annotated. The annotation of primary symptoms is conducted to incorporate a graded relevance information into the collection. For the annotation procedure, we develop guidelines that briefly describe the task and an online annotation tool. Then, the annotation of primary symptoms is conducted by three physicians. The final label is obtained by a majority voting. Based on the labels obtained from the majority voting, we assign the relevance score 2 to primary symptoms and 1 to the other symptoms, which we call relevant symptoms.

In total, the dsr-collection contains relevant symptoms and primary symptoms for 20 diseases. We give an overview of the collection in Table 1. For the 20 diseases, the collection contains a total of 235 symptoms, of which 55 are labeled as primary symptom (about 25%). The top-3 most occurring symptoms are: fatigue which appears for 15 of the 20 diseases, fever which appears for 10, and coughing which appears for 7. Notice that the diseases are selected from different medical disciplines: mental (e.g. Depression), dental (e.g. Periodontitis), digestive (e.g. Appendicitis), and respiration (e.g. Asthma).

Disease #S #P Disease #S #P
Anorexia Nervosa 7 2 1.00 Influenza 11 2 0.57
Appendicitis 7 2 1.00 Measles 9 4 0.38
Asthma 9 4 0.76 Mental Depression 13 3 0.21
Bronchitis 9 1 0.71 Migraine Disorders 12 4 0.37
Cholecystitis 12 1 0.55 Myocardial Infarction 11 4 0.44
COPD 7 3 0.83 Periodontitis 3 4 0.46
Diabetes Mellitus 11 3 0.72 Pulmonary Embolism 13 2 0.83
Epididymitis 8 2 0.67 Sleep Apnea Syndromes 13 2 0.31
Erysipelas 7 3 0.69 Tonsillitis 7 4 0.63
GERD 8 2 0.76 Trigeminal Neuralgia 3 3 0.28
Table 1: Overview of the dsr-collection. For each disease, we display the number of relevant symptoms (#S), the number of primary symptoms (#P), and the Fleiss’ inter-annotator agreement ().

We calculate the inter-annotator agreement using Fleiss’ kappa [3], a statistical measure to compute the agreement for three or more annotators. For the annotation of the primary symptoms, we measure a kappa value of , which indicates a substantial agreement between the three annotators [7]. Individual -values per disease are reported in Table 1. By analyzing the disagreements, we found that the annotators labeled primary symptoms with varying frequencies: The first annotator annotated on average 2.1 primary symptoms per disease, the second 2.8, and the third 3.8.

Vocabulary Compatibility:

We map each disease and symptom of the collection to the Unified Medical Language System (UMLS) vocabulary. The UMLS is a compendium of over 100 vocabularies (e.g. ICD-10, MeSH, SNOMED-CT) that are cross-linked with each other. This makes the collection compatible with the UMLS vocabulary and also with each of the over 100 cross-linked vocabularies.

Although the different vocabularies are compatible with the collection, a fair comparison of methods is only possible when the methods utilize the same vocabulary since the vocabulary impacts the evaluation outcome. For instance, the symptom loss of appetite is categorized as a symptom in MeSH; whereas, in the cross-linked UMLS vocabulary, it is categorized as a disease. Therefore, the symptom loss of appetite can be identified when using the MeSH vocabulary, but it cannot be identified when using the UMLS vocabulary.


We consider following evaluation metrics for the collection: Recall@k, Precision@k, and nDCG@k at the cutoff and . Recall measures how many of the relevant symptoms are retrieved, Precision measures how many of the retrieved symptoms are relevant, and finally, nDCG is a standard metric to evaluate graded relevance [6].

3 Disease-Symptom Extraction Methods

3.1 Related Methods

In this section, we discuss disease-symptom extraction methods used in previous studies. A commonly used resource for the extraction of disease-symptom relations are the articles of the PubMed database. PubMed contains more than 30 million biomedical articles, including the abstract, title, and various meta-data. Previous work [8, 4] uses the abstracts of the PubMed articles together with rule-based approaches. In particular, Hassan et al. [4] derive patterns of disease-symptom relations from dependency graphs, followed by the automatic selection of the best patterns based on proposed selection criteria. Martin et al. [8] generate extraction rules automatically, which are then inspected for their viability by medical experts. Xia et al. [13] design special queries that include the name and synonyms of each disease and symptom. They use these queries to return the relevant articles, and use the number of retrieved results to perform a ranking via Pointwise Mutual Information (PMI).

The mentioned studies use resources that are not publicly available, i.e., rules in [8, 4] and special queries in [13]. To enable reproducibility in future studies, we define our baselines based on the methods that only utilize publicly available resources, described in the next section.

3.2 Baseline Methods

Here, we first describe two recently proposed methods [11, 14] for the extraction of disease-symptom relations as our baselines. Afterwards, we describe limitations of the method described in [14] and propose an adapted version in which the limitations are addressed. We apply the methods on the the open-access subset of the PubMed Central (PMC) database, containing 1,542,847 medical articles. To have a common representation for diseases/symptoms across methods (including an unique name and identifier), we consider the 382 symptoms and 4,787 diseases from the Medical Subject Headings (MeSH) vocabulary [14]. Given the set of diseases () and symptoms (), each method aims to compute a relation scoring function between a disease and a symptom . In the following, we explain each method in detail.


Proposed by Shah et al. [11], the method is based on the cosine similarity of the vector representations of a disease and a symptom. We first apply MetaMap [1], a tool for the identification of medical concepts within a given text, to the full text of all PMC articles to substitute the identified diseases/symptoms by their unique names. Then, we train a word2vec model [9] with 300 dimensions and a window size of 15, following the parameter setting in [11]. Using the word embedding, the disease-symptom relation is defined as , where refers to the vector representation of a word.


This method, proposed by Zhou et al. [14], calculates the relation of a disease and a symptom, by measuring the degree of their co-occurrences in the MeSH-keywords of medical articles. The raw co-occurrence of the disease and symptom , is denoted by . The raw co-occurrence does not consider the overall appearance of each symptom across diseases. For instance, symptoms like pain or obesity tend to co-occur with many diseases, and are therefore less informative. Hence, the raw co-occurrence is normalized by an Inverse Symptom Frequency (ISF) measure, defined as , where is the total number of diseases and is the number of diseases that co-occur with at least in one of the articles. Finally, the disease-symptom relation is defined as . We compute three variants of the CoOccur method:

  • Kwd: The disease-symptom relations are computed using the MeSH-keywords of the million PMC articles.

  • KwdLarge: While Kwd uses the 1.5 million PMC articles, Zhou et al. [14] apply the exact same method on the million articles of the PubMed database. While they did not evaluate the effectiveness of their disease-symptom relation extraction method, they published their relation scores which we will evaluate in this paper.

  • FullText: Applying the CoOccur method only on MeSH-keywords has two disadvantages: First, keywords are not available for all articles (e.g. only 30% of the million PMC articles have keywords) and second, usually only the core topics of an article occur as keywords. We address these limitations by proposing an adaption of the CoOccur method, in which we use the full text, the title, and the keywords of the million PMC articles. Specifically, we adapt the computation of the co-occurrence , as follows: We first retrieve a set of relevant articles to a disease , where an article is relevant if the disease exists in either the keyword, or the title section of the article. Given these relevant articles and a symptom , we compute the adapted co-occurrence , which is the number of relevant articles in that the symptom occurs in the full text. The identification of the diseases in the title and symptoms in the full text is done using the MetaMap tool [1].

4 Evaluation Results & Discussion

We now compare the disease-symptom extraction baselines on the proposed dsr-collection. The results for various evaluation metrics are shown in Table 2. The FullText-variant of the CoOccur method outperforms the other baselines on all evaluation metrics. This demonstrates the high effectiveness of our proposed adaption to the CoOccur method.

Further, we see a clear advantage of the CoOccur-method with MeSH-keywords from million PubMed articles as the resource (KwdLarge) – in comparison to the same method with keywords from approximately 1.5 million PMC articles (Kwd). This highlights the importance of the number of input samples to the method.

Method nDCG@5 P@5 R@5 nDCG@10 P@10 R@10
Embedding 0.20 0.18 0.08 0.19 0.15 0.13
CoOccur-Kwd 0.27 0.22 0.10 0.22 0.14 0.12
CoOccur-KwdLarge 0.32 0.27 0.12 0.28 0.19 0.17
CoOccur-FullText 0.41 0.39 0.17 0.36 0.28 0.25
Table 2: Comparison of the disease-symptom extraction methods using our proposed dsr-collection. We show significant improvements with: refers to Embedding, to Kwd, and to KwdLarge

(two-sided, paired t-test:


Error Analysis:

A common error source is a result of the fine granularity of the symptoms in the medical vocabularies. For example, the utilized MeSH vocabulary contains the symptoms abdominal pain and abdomen, acute444Symptom for acute abdominal pain. Both symptoms can be found in the top ranks of the evaluated methods for the disease appendicitis (see Table 3). However, since the corpus is not labeled on such a fine-grained level, the symptom abdomen, acute is counted as a false positive.

Embedding Kwd KwdLarge FullText
Abdomen, Acute Abdominal Pain Abdomen, Acute Abdomen, Acute
Abdominal Pain Abdomen, Acute Abdominal Pain Abdominal Pain
Fever of Unknown Origin Obesity Pelvic Pain Vomiting
Renal Colic Thinness Pain, Postoperative Nausea
Table 3: Top-4 extracted symptoms of each method for the disease appendicitis. The retrieved relevant symptoms and primary symptoms are highlighted.

Another error source is a result of the bias in medical articles towards specific disease-symptom relationships. For instance, between the symptom obesity and periodontitis555A dental disease where the gum that surrounds the teeth retreats a special relationship exists, which is the topic of various publications. Despite obesity not being a characteristic symptom of a periodontitis, all methods return the symptom in the top-3 ranks. A promising research direction is the selective extraction of symptoms from biomedical literature by also considering the context (e.g. in a sentence) in that a disease/symptom appears.

5 Conclusion

We introduced the Disease-Symptom Relation Collection (dsr-collection) for the evaluation of graded disease-symptom relations. We provided baseline results for two recent methods, one based on word embeddings and the second on the co-occurrence of MeSH-keywords of medical articles. We proposed an adaption to the co-occurrence method to make it applicable to the full text of medical articles and showed significant improvement of effectiveness over the other methods.


  • [1] A. R. Aronson (2001) Effective mapping of biomedical text to the UMLS Metathesaurus: the MetaMap program.. Proceedings of the American Medical Informatics Association Symposium, pp. 17–21. External Links: ISSN 1531-605X, 11825149 Cited by: 3rd item, §3.2.
  • [2] E. P. G. del Valle, G. L. García, L. P. Santamaría, M. Zanin, E. M. Ruiz, and A. R. González (2018-06) Evaluating Wikipedia as a Source of Information for Disease Understanding. In 2018 IEEE 31st International Symposium on Computer-Based Medical Systems (CBMS), pp. 399–404. Cited by: §1.
  • [3] J. L. Fleiss (1971) Measuring nominal scale agreement among many raters.. Psychological Bulletin 76 (5), pp. 378. Cited by: §2.
  • [4] M. Hassan, O. Makkaoui, A. Coulet, and Y. Toussain (2015) Extracting Disease-Symptom Relationships by Learning Syntactic Patterns from Dependency Graphs. In Proceedings of BioNLP 15, Beijing, China, pp. 71–80 (en). Cited by: §3.1, §3.1.
  • [5] K. Järvelin and J. Kekäläinen (2002) Cumulated gain-based evaluation of IR techniques. ACM Transactions on Information Systems (TOIS) 20 (4), pp. 422–446. Cited by: §1.
  • [6] J. Kekäläinen (2005) Binary and graded relevance in IR evaluations – comparison of the effects on ranking of IR systems. Information Processing & Management 41 (5), pp. 1019–1033. Cited by: §1, §2.
  • [7] J. R. Landis and G. G. Koch (1977) The measurement of observer agreement for categorical data. Biometrics, pp. 159–174. Cited by: §2.
  • [8] L. Martin, D. Battistelli, and T. Charnois (2014) Symptom extraction issue. In Proceedings of BioNLP 2014, Baltimore, Maryland, pp. 107–111 (en). Cited by: §1, §3.1, §3.1.
  • [9] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean (2013) Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems, pp. 3111–3119. Cited by: §3.2.
  • [10] J. Ni, H. Fei, W. Fan, and X. Zhang (2017-11) Automated Medical Diagnosis by Ranking Clusters Across the Symptom-Disease Network. In 2017 IEEE International Conference on Data Mining (ICDM), pp. 1009–1014. Cited by: §1.
  • [11] S. Shah, X. Luo, S. Kanakasabai, R. Tuason, and G. Klopper (2019-12) Neural networks for mining the associations between diseases and symptoms in clinical notes. Health Information Science and Systems 7 (1) (en). External Links: ISSN 2047-2501 Cited by: §1, §1, §3.2, §3.2.
  • [12] Y. Shen, Y. Li, H. Zheng, B. Tang, and M. Yang (2019-12)

    Enhancing ontology-driven diagnostic reasoning with a symptom-dependency-aware Naïve Bayes classifier

    BMC Bioinformatics 20 (1), pp. 330 (en). External Links: ISSN 1471-2105 Cited by: §1.
  • [13] E. Xia, W. Sun, J. Mei, E. Xu, K. Wang, and Y. Qin (2018) Mining Disease-Symptom Relation from Massive Biomedical Literature and Its Application in Severe Disease Diagnosis. In 45 - AMIA 2018 Annual Symposium, pp. 1118–1126 (en). Cited by: §1, §3.1, §3.1.
  • [14] X. Zhou, J. Menche, A. Barabási, and A. Sharma (2014-12) Human symptoms–disease network. Nature Communications 5 (1) (en). External Links: ISSN 2041-1723 Cited by: §1, §1, 2nd item, §3.2, §3.2.