Active research in the medical and biomedical domain has generated pervasive documents and articles. MEDLINE, the largest biomedical text database, has more than 16 million articles. It is estimated that more than 10,000 articles are added to MEDLINE weekly(Yoo and Song, 2006). There are continuously needs for development of techniques to discover, search, access and share knowledge from these documents and articles. Text clustering techniques enable us to group similar text documents in an unsupervised manner.
Most of the research related to the biomedical document clustering focuses on either reforming the representation of biomedical documents or improving the clustering algorithms. Biomedical document clustering is different from the general text document clustering task, because in the latter, semantic similarities between words or phrases are not usually considered. One medical concept of disease might be represented in different forms, and some medical concepts of diseases might be highly correlated. For example, ‘Type 2 Diabetes’ is the same concept of disease as ‘Diabetes Mellitus Type 2’. ‘Hypertension’ might co-occur often with ‘Stroke’. In order to capture the semantic similarities between words or phrases, previous research on document representation reforming (Logeswari and Premalatha, 2013) (Yoo and Hu, 2006) (Zhang et al., 2007) often use existing ontology such as MeSH or WordNet to identify the semantic relationships. However, ontology doesn’t reflect the co-occurrences of medical concepts. This paper focuses on biomedical document clustering based on the concepts of diseases. The proposed similarity measure between the concepts of diseases is based on the Word2vec model (Mikolov et al., 2013). This similarity measure identifies the closest concepts based on co-occurrences of the concepts. The proposed concept weighting scheme is the linear combination of the TF-IDF value which reflects the content similarity between documents and the similarity score based on the proposed similarity measurements that reflect the semantic similarity between documents.
The unsupervised learning algorithm Self-Organizing Map (SOM)(Kohonen, 1998)
has been used as the clustering technique. SOM has properties of both vector quantization and vector projection. The neurons of an SOM can be presented on a two dimensional space. By projecting the input data instances to their best matching units (BMUs) on the SOM map, the distribution of the inputs can be visualized on the two dimensional space with the U-matrix and hit histogram of the SOM. The relationships of the clusters based on the concepts of diseases can be visualized. This clustering visualization is a beneficial feature for biomedical literature search and browsing based on concepts of diseases.
The rest of the paper is organized as followings. In section 2, related work is described. Section 3 demonstrates how the concepts of diseases are extracted by using UMLS MetaMap. Section 4 and 5 detail the measurement of concepts similarity and weighting scheme for each concept in the document representation. Experimental settings and results are given in section 6. Section 7 concludes his research and discusses potential future work.
2. Related Work
A lot of research has been done in biomedical document clustering in past decades. Some of it focused on document presentation reforming based on medical ontology or on using different weighting scheme other than TF-IDF, while some others focused on investigating various clustering algorithms.
Zhang et al. (Zhang et al., 2007) reviewed three different ontology based term similarity measurements: path based (Wu and Palmer, 1994), information content based (Resnik et al., 1999), and feature based (Knappe et al., 2007)
and then proposed their own similarity measurement and term re-weighting scheme. K-means algorithm is used for document clustering. Based on the results comparison, some of them are slightly worse than the word based scheme. The authors mentioned that it might because of the limitation of the domain ontology, term extraction and sense disambiguation. Visualization of the relationships between the clusters were not included in this research.
Yoo et al. (Yoo and Song, 2006) used a graphical representation method to represent a set documents based on the MeSH ontology, and proposed the document clustering and summarization with this graphical representation. The document clustering and summarization model gained comparable results on clustering and also provided some visualization on the documents cluster model based on the relationships of the terms. However, this visualization relies largely on the MeSH ontology instead of the document relationships themselves.
Logeswari et al. (Logeswari and Premalatha, 2013)
proposed a concept weighting scheme based on the MeSH ontology and tri-gram extraction to extract concepts from the text corpus. The semantic relationship between tri-grams are weighted through a heuristic weight assignment of four predefined semantic relationships. The K-means clustering algorithm results show that concept based representation was better than word based representation. Visualization of the clustering results was not investigated.
Gu et al. (Gu et al., 2013) proposed a concept similarity measurement by using a linear combination of multiple similarity measurements based on MeSH ontology and local content which include TF-IDF weighting and co-efficient calculation between related article sets. A semi-supervised clustering algorithm was employed at the stage of document clustering. Their focus was not clustering visualization.
Some research has been done about the visualization process to support biomedical literature search. Gorg et al. (Görg et al., 2010) developed a visual analytics system, named Bio-Jigsaw by using the MeSH ontology. This research demonstrated how visual analytics can be used to analyze a search query on a gene related to breast cancer. Neither document representation nor document clustering were discussed.
To the best of authors’ knowledge, this research is the first to present concepts of diseases using vectors based on the Word2vec model instead of using an ontology. The proposed similarity measurement and the concept weighting scheme are first applied to the biomedical document clustering. The SOM based clustering is employed to visualize the distribution of document clusters based on the concepts of diseases.
3. Concepts of Diseases Extraction
In this work, the focus is on clustering biomedical documents based on the concepts of diseases that are addressed by or mentioned in the documents. To extract the concepts of diseases from the documents, Unified Medical Language System (UMLS) MetaMap is used. UMLS MetaMap (Met, [n. d.]b)
is a natural language processing tool that makes use of various sources such as UMLS Metathesaurus(Met, [n. d.]a) and SNOMED CT (SCT, [n. d.]) to map the phrases or terms in the text to different semantic types.
provides an example of mapping phrases to different semantic types using UMLS MetaMap. In this example, eight terms or phrases in the sentence have been mapped to six semantic types. The phrase ‘Haemophilus influenzae type b meningitis’ in the sentence has been identified as semantic type ‘disease or syndrome’ and mapped to phrase ‘Type B Hemophilus influenzae Meningitis’ based on the lexicon that UMLS MetaMap uses. In this research, if a term or phrase has been mapped to semantic types ‘Disease or Syndrome’ or ‘Neoplastic Process’, the corresponding phrase in the lexicon produced by MetaMap is extracted.
4. Concepts Similarity Measure
In the biomedical literature, same concepts of a disease can be presented by different terms or combinations of words. For example, ‘cancer of breast’ and ‘ breast cancer’ are two phrases that present the concept of the same disease. However, they are treated as two different concepts if typical vector space model and TF-IDF weighting scheme are used for document presentation, and the semantic similarity between them is not measured. In this research, a semantic similarity measure between different concepts of diseases is proposed. Given a total of concepts of diseases extracted from the raw text corpus, the similarities between any two concepts are stored in the similarity matrix as presented in Equation 1. Each entry in the matrix represents the similarity between concept and .
To calculate the similarity between two concepts, first, each word is represented by a vector (as proposed in Equation 2). This vector representation is learned by training the Word2Vec model. The Word2Vec training algorithm was developed by a team of researchers at Google led by Tomas Mikolov (Mikolov et al., 2013)
. It is a computationally-efficient algorithm to generate vectors of real numbers to present words in a given raw text corpus. These vector representations are learned using three-layer neural networks using either a continuous bag-of-words approach or a skip-gram architecture. The vectors preserve the distances between words in the vector space so that the words that share common contexts in the raw text corpus are located in close proximity to one another. The dimension of the vector created depends on the number of neurons in the hidden layer of the neural network when training a Word2Vec model.
: the dimension of the vector.
In this research, a trained Word2Vec model (Moen and Ananiadou, 2013) created from a subset of PubMed literature database and a subset of PubMed Central (PMC) Open Access database is employed. These two text corpus contains a large number of biomedical documents. The trained model creates 200 dimensional vectors to present the words extracted in the two text corpus. The skip-gram architecture with a window size of 5 is adopted for the learning process (Moen and Ananiadou, 2013).
Although some of concepts of diseases contain only one word, many of them span multiple words. In this work, if a concept of disease spans multiple words, a concept vector is generated by aggregating the vectors of all the words in the concept, as shown in Equation 3. For example, for the disease ‘diabetes mellitus’, the vector for ‘diabetes’ and the vector for ‘mellitus’ are aggregated by adding them together.
: the total number of words in a concept .
The similarity score between the concepts are calculated using the cosine distance between the vectors as shown in Equation 4.
By presenting concepts in vector and using this similarity measure, it is observed that the more the diseases are associated, the higher the similarity scores between them are. Table 1 provides some examples of concepts of diseases and the their top 3 closest concepts based on the similarity scores. ‘Hypertension’ is often associated with ‘hyperlipidaemia’ in the literature, so the similarity between them is higher than that between ‘Hypertension’ and other concepts of diseases.
|Concept||Closest Concepts||Score of|
|carpal tunnel||bilateral carpal tunnel syndrome||0.970|
|syndrome||cts carpal tunnel syndrome||0.957|
|diabetes||diabetes mellitus type ii||0.868|
|dm diabetes mellitus||0.845|
5. Document Representation and Weighting Scheme
In this research, the typical vector space model is used to present a biomedical document, each entry of the vector corresponding to a concept of disease which is identified through the UMLS MetaMap. The proposed weight () that is given to each concept () is calculated as equation 5:
: the number of documents in which concept occurs at least once
: frequency of concept in document
: total number of documents in the corpus
: the similarity between and concept that both occur in document . is the frequent concept in the document .
: the total number of concepts in document .
: top closest concepts of . In this research, .
If a concept occurs in a document, the weighting scheme uses the TF-IDF value to underline the occurrence of the concept in the local content. The calculates the sum of similarity scores between the occurred concept and other concepts (, ) that also occurs within the document. If a concept does not occur in the document, the weight is calculated by a weighted sum of the top 3 closest concepts (, ) that appear in the document based on the similarities scores. By using this weighting scheme, the representation measures the occurrences of different representations of the same or similar concepts. For example, ‘diabetes’ occurs in one document, but ‘diabetes mellitus’ occurs in another document. By using the traditional TF-IDF weighting scheme, their values would be 0 for documents in which the concept does not appear. However, by using the proposed weighting scheme, they are weighted based on the similarity between the concept and its closest concepts. Thus, for the document that does not contain the concept ‘diabetes mellitus’, instead of using 0, the similarity score between ‘diabetes mellitus’ and other concepts that appear in the document is used.
6. Clustering Algorithm
Self-Organizing Map (SOM) is used for document clustering visualization (Kohonen et al., 2000). SOM implements the topologically ordered display of the data to facilitate understanding the structures of the input data set. It is also readily explainable and easy to visualize. The visualization of the multidimensional data is one of the main application areas of SOM (Kohonen, 1998). These features make SOM an appropriate choice as a clustering algorithm for this paper.
A basic SOM consists of neurons located on a low dimensional grid (usually 1 or 2 dimensional) (Kohonen, 1998). The algorithm responsible for the formation of the SOM involves three basic steps after initialization: sampling, similarity matching, and updating. These three steps are repeated until formation of the feature map has completed. Each neuron has a -dimensional prototype weight vector . Given is a -dimensional sample data(input vector), the algorithm is summarized as follows:
Choose random values to initialize all the neuron weight vectors where is the total number of neurons in the map.
Draw a sample data
from the input space with a uniform probability.
Find the best matching unit (BMU) or winner neuron of , denoted here by which is the closest neuron (map unit) to in the criterion of minimum Euclidean distance, at time step ( training iteration).
Adjust the weight vectors of all neurons by using the update formula 7, so that the best matching unit (BMU) and its topological neighbors are moved closer to the input vector in the input space.
Where denotes the learning rate and is the suitable neighborhood kernel function centered on the winner neuron.
The distance kernel function can be, for example, Gaussian:
Where and denote the positions of neuron and on the SOM grid and is the width of the kernel or neighborhood radius at step . decreases monotonically along the steps as well. The initial value of neighborhood radius should be fairly wide to avoid the ordering direction of neurons to change discontinuously. can be properly set to be equal to or greater than half the diameter of the map. Formula 9 gives the initial value of the neighborhood radius for a map of size by .
Continue with sampling until no noticeable changes in the feature map are observed or the pre-defined maximum number of iterations is reached.
The most commonly used visualization techniques of SOM are the U-Matrix and Hit histogram.
The U-matrix (Kohonen, 1998) holds all distances between neurons and their immediate neighbor neurons. Figure 2 shows the U-matrix of a trained map on a input data set that has two clusters. The lighter the color in the hexagon connecting any two neurons, the smaller is the distance between them. From the U-matrix, two large light regions can be visualized. One is towards the left, while the other is to the right. These regions present the two clusters obtained on training the input data set. The U-matrix gives a direct visualization of the number of clusters and their distribution.
The hit histogram of the input data set on the trained map provides a visualization that details the distribution of input data across the clusters. Each input data instance in the data set can be projected to the closest neuron on a trained SOM map. The closest neuron is called the best matching unit (BMU) of the input data instance. The hit histogram is constructed by counting the number of hits each neuron receives from the input data set. Figure 3 shows the hit histogram of an input data set on the trained SOM map. Each hexagon represents one neuron on the map. The size of the marker indicates the number of hits the neuron receives. Thus, a larger marker is representative of a larger number of hits on that neuron. Based on the hit histogram, it is visualized that most of the input data hits neurons in the left and right regions. These two regions correspond to the two clusters on the U-matrix shown in Figure 2.
7. Experiment Setting and Result Analysis
To evaluate the proposed biomedical document clustering framework that is based on the concepts of diseases, two subsets of large biomedical document collections have been used: PubMed Central Open Access and Ohsumed collection. The details of these two document collections and the corresponding clustering results and visualization are detailed in the following subsections.
PubMed Central Open Access data set has been used by many research projects to examine tasks of biomedical literature clustering and classification (Zhang et al., 2007) (Zhu et al., 2009). It is also a part of the training corpus for the Word2Vec model used in this research. Ohsumed document collection is a data set that has been used by many researchers (Bloehdorn et al., 2006) (Simeon and Hilderman, 2008) for text mining. Although the Ohsumed collection includes documents that are not up to date, it is used to evaluate the robustness of the proposed document clustering framework on a data set where concepts might be presented differently than the ones included in the training corpus for Word2Vec.
7.1.1. PubMed Central Open Access (PMC-OA)
The PubMed Central Open Access (PMC, [n. d.]) is a subset of over 1 million articles from the total collection of articles in PMC. For this research, a set of 600 articles were randomly selected from the ‘A-B’ subset which includes articles from journals whose names start with letter ‘A’ or ‘B’. The number of selected articles from each journal is shown in Table 2.
|Name of journal||# of documents|
|American Journal of Hypertension||13|
|Augmentative and alternative communication||2|
|Ancient Science of Life||3|
|Bioinformatics and biology insights||45|
|Allergy and asthma proceedings||28|
|Anesthesia, essays and researches||135|
|Biological trace element research||31|
|Bone Marrow Research||1|
|Brain and language||1|
|American journal of physiology.||11|
|Endocrinology and metabolism|
|Annals of rehabilitation medicine||323|
To be consistent with the data set - Ohsumed Collection, only content in ‘Title’ and ‘Abstract’ sections from these documents are used. 658 unique concepts of diseases are identified by UMLS MetaMap. Figure 4 shows the distribution of these concepts based on the number of words in each concept.
7.1.2. Ohsumed Collection
The Ohsumed collection (ohs, [n. d.]) used here includes the abstracts of 20,000 articles. These articles are related to cardiovascular diseases and are further categorized into 23 cardiovascular disease categories. For this research, a subset of 600 documents is randomly selected. These documents cover all the 23 categories. Table 3 shows the number of documents selected from each category.
|Category||Label||# of documents|
|Bacterial Infections and Mycoses||C01||22|
|Digestive System Diseases||C06||23|
|Respiratory Tract Diseases||C08||24|
|Nervous System Diseases||C10||23|
|Urologic and Male Genital Diseases||C12||26|
|Female Genital Diseases and||C13||27|
|Hemic and Lymphatic Diseases||C15||28|
|Neonatal Diseases and Abnormalities||C16||25|
|Skin and Connective Tissue Diseases||C17||28|
|Nutritional and Metabolic Diseases||C18||27|
|Disorders of Environmental Origin||C21||26|
|Pathological Conditions, Signs||C23||29|
After concepts of diseases are identified by using the UMLS MetaMap, 67 documents have no disease-related concept identified. These documents are not included in the experiments. In total, 1449 concepts of diseases are identified and extracted from the 533 documents. Figure 4 shows the distribution of these concepts based on the number of words in the concepts in comparison with the concepts extracted from PMC-OA. Although the total number of concepts of diseases extracted from the Ohsumed collection is higher than that are extracted from the PMC-OA. The distribution based on the number of words in the concepts is very similar. Ohsumed collection has a slightly higher radio of concepts of one word whereas PMC-OA collection has a higher radio of concepts spanning two words. The percentages of concepts with 3 or more words are almost the same.
7.2. Clustering, Visualization and Discussion
SOM has been used for document clustering after concepts extraction and document representation using the proposed weighting scheme. The size of the map is 10 by 10 which contains 100 neurons. The training iterations are set to be 50,000.
Figure 5 shows the clustering results of document collection PMC-OA. Among 658 concepts that are extracted by using the UMLS MetaMap, 180 concepts occur in more than 1 document. Compare to the Ohsumed collection, a larger number of concepts have document frequency more than 1. That means the TF-IDF value in the weighting scheme has more impact on PMC-OA dataset than it does on the Ohsumed collection. The U-matrix of the trained SOM map shows more clear boundaries than that of the Ohsumed collection.
From the U-matrix and hit histogram, 8 clusters can be clearly identified based on the darker colored neurons surrounding them. Cluster 5 includes a large cluster with documents consisting of concepts of diseases like ‘obesity’, ‘diabetes’, ‘hypertension’, ‘hyperglycemia’, and so on. This cluster also includes other diseases such as ‘coronary artery disease’, since these diseases are highly related. The ‘coronary artery disease’ might be an outcome of ‘hypertension’, ‘hyperglycemia’ or their combination. Cluster 6 contains documents discussing infections related to diseases such as ‘Malaria’, and other respiratory infection diseases such as ‘tuberculosis’ and ‘bronchitis’ are also included in this cluster. Cluster 8 is a smaller cluster which mainly includes chest infection related documents. Cluster 7 is another smaller cluster about concepts of infections such as avian influenza. Cluster 6, 7 and 8 are close to each other since they are all about infections, but each is focused on a smaller areas of infections. Cluster 4 contains three neurons which present three different types of concepts. The left region is dominated by documents which discuss ‘dysphagia’ and similar concepts such as ‘laryngospasm’. The concepts in the right half of the cluster include spinal disorders like ‘stenosis’, ‘scoliosis’ and ‘spinal instability’. The top of the cluster is dominated by pain related diseases and syndromes. Cluster 3 contains documents that are related to ‘paralysis’ and damage of nerves. Many of them discuss paralysis of the face, hands (‘carpel tunnel syndrome’), spine (‘spinal cord atrophy’), brain (‘celebral palsy’), legs (‘spastic foot’), and so on. All these concepts are related and close to each other, and thus the cluster is well formed. Cluster 2 is a cluster of documents with concepts about different cardiovascular diseases such as ‘hypertension’, ‘myocardial infarction’, ‘coronary artery disease’, ‘coronary heart disease’, and ‘ischemic strokes’. This cluster also consists of a few documents that talk about brain strokes arising from lesions in the brain, or lead to speech disorders. More than half of the documents in this cluster discuss strokes and closely related cardiac concepts. One interesting finding is that cluster 1 contains all the documents in which the only one concept is identified by UMLS MetaMap is ‘stroke’. However, further analysis shows that these documents have nothing to do with ‘stroke’ as a disease. This shows the UMLS MetaMap cannot always accurately map all concepts to the semantic types.
Figure 6 shows the clustering results of Ohsumed collection based on the concepts of diseases that are extracted. Based on the document frequencies of the concepts of Ohsumed collection, there are 1108 concepts out of the total 1449 that occurs only in one document and there are total 1414 concepts occur in less than five documents. That means the weights of these concepts rely heavily on the similarity measurement between the concepts.
Based on the original data set description, all documents are related to cardiovascular diseases. This lead to the shorter distances between neurons which is reflected by the color of the U-matrix. By analyzing the U-matrix and hit histogram of the trained map, 8 clusters are identified. A majority of the documents in cluster 7 are about infections and infectious diseases, with half of them from the categories of bacterial infections and mucoses (C01), virus diseases (C02) and parasitic diseases (C03). The rest of the documents from this cluster discuss other infections from categories like respiratory tract diseases (C08) and digestive system diseases (C06). There are also a few documents from immunologic diseases (C19) in this cluster. Notably, all of the documents talk about infections of different types. Cluster 1 has documents that discuss diseases about the nervous system. Whereas, the documents in cluster 2 discuss neoplasms which include different types of cancers of the brain, prostate, neck and so on. The documents in this cluster are from all the categories except virus diseases (C02) and diseases of environmental origin (C21). Cluster 3 includes documents about diseases related to hormone secretion and distribution. This cluster also includes diseases of the bones and blood, since these concepts are closely related. Cluster 4 contains documents with diseases about the ear, nose, throat, head and surrounding areas of the face. Cluster 5 is the smallest cluster and the documents concentrate on different types of tuberculosis and sexually transmitted diseases like AIDS, HPV, etc. Documents about ‘cryptococcosis’, which is often seen in patients with HIV whose immunity has been lowered, also fall in this cluster. Cluster 6 consists of documents with concepts relating to diabetes. Documents containing concepts like ‘nephropathy’, ‘impaired glucose tolerance’,‘non-insulin dependent diabetes’ are in the left half of the cluster. Whereas, the right half of the cluster is dominated by documents with concepts such as ‘Crohn’s disease’, ‘renal ulceration’, and ‘kidney stone’. Cluster 8 has documents about diseases related to different heart conditions and obstruction in the flow of blood. Since the theme of documents in Ohsumed collection is cardiovascular concepts, this cluster has documents from all of the categories except parasitic diseases (C03), neoplasms (C04) and digestive system diseases (C06).
It is worth noting that only concepts of diseases are extracted from both data sets and used for document clustering. While the original category labels of the Ohsumed collection might not be assigned based on the concepts of diseases, thus, these labels are used to evaluate the clustering performance.
Overall, the proposed document clustering and visualization framework works well on both data sets. Although Ohsumed collection has much more concepts diseases, and the majority of them have very low document frequency. On the U-matrix, the colors of the neurons surrounding the clusters demonstrate how separated these clusters are. The darker the color is, the more separated they are. That means clusters are more unrelated. The clusters on the U-matrix of the PMC-OA appear to be more separated than those of the Ohsumed collection. One reason could be that all documents are related to cardiovascular diseases, so the clusters locate more closely on the U-matrix.
8. Conclusion and Future Work
In this paper, a biomedical document clustering framework based on concepts of diseases is proposed. The concepts of diseases are identified by using UMLS MetaMap. Instead of using an existing ontology to generate concept representation, the concepts are represented by using vectors based on a combination of TF-IDF and Word2Vec models. The proposed similarity measure is based on the vector representations of the concepts and shows that closely associated concepts of diseases have higher similarity scores than others. A representation of documents that considers the local content and semantic similarity between the concepts within the documents is used. A weighting scheme using TF-IDF combined with similarity score between the concepts is proposed. Instead of focusing on clustering performance evaluation, clustering visualization is explored in this research. Self-Organizing Map is a clustering algorithm that provides a visualization aid to understand the clusters and distribution of the clusters, and is thus used in this research. The results show that the clustering occurs along concepts of similar nature, of similar area and organs of the body, and concepts which are synonymous to one another. Nearby clusters are related in most cases, as well. This kind of visualization will help researchers explore related articles based on concepts of diseases.
Potential future work includes visualizing clusters of larger corpora by using a hierarchical clustering architecture, evaluating this visualization aid for the task of biomedical document search and extending this framework to biomedical document clustering based on concepts of symptoms and treatments.
- Met ([n. d.]a) [n. d.]a. Fact Sheet - UMLS Metathesaurus. https://www.nlm.nih.gov/pubs/factsheets/umlsmeta.html.
- Met ([n. d.]b) [n. d.]b. MetaMap - A Tool For Recognizing UMLS Concepts in Text. https://metamap.nlm.nih.gov/.
- PMC ([n. d.]) [n. d.]. Open Access Subset. https://www.ncbi.nlm.nih.gov/pmc/tools/openftlist/.
- SCT ([n. d.]) [n. d.]. SNOMED CT. https://www.nlm.nih.gov/healthit/snomedct/.
- ohs ([n. d.]) [n. d.]. Text Categorization Corpora. http://disi.unitn.it/moschitti/corpora.htm.
et al. (2006)
Stephan Bloehdorn, Philipp
Cimiano, and Andreas Hotho.
Learning ontologies to improve text clustering and
From data and information analysis to knowledge engineering. Springer, 334–341.
- Görg et al. (2010) Carsten Görg, Hannah Tipney, Karin Verspoor, William A Baumgartner Jr, K Bretonnel Cohen, John Stasko, and Lawrence E Hunter. 2010. Visualization and language processing for supporting analysis across the biomedical literature. In International Conference on Knowledge-Based and Intelligent Information and Engineering Systems. Springer, 420–429.
- Gu et al. (2013) Jun Gu, Wei Feng, Jia Zeng, Hiroshi Mamitsuka, and Shanfeng Zhu. 2013. Efficient semisupervised MEDLINE document clustering with MeSH-semantic and global-content constraints. IEEE transactions on cybernetics 43, 4 (2013), 1265–1276.
- Knappe et al. (2007) Rasmus Knappe, Henrik Bulskov, and Troels Andreasen. 2007. Perspectives on ontology-based querying. International Journal of Intelligent Systems 22, 7 (2007), 739–761.
- Kohonen (1998) Teuvo Kohonen. 1998. The self-organizing map. Neurocomputing 21, 1 (1998), 1–6.
- Kohonen et al. (2000) Teuvo Kohonen, Samuel Kaski, Krista Lagus, Jarkko Salojarvi, Jukka Honkela, Vesa Paatero, and Antti Saarela. 2000. Self organization of a massive document collection. IEEE transactions on neural networks 11, 3 (2000), 574–585.
- Logeswari and Premalatha (2013) S Logeswari and K Premalatha. 2013. Biomedical document clustering using ontology based concept weight. In Computer Communication and Informatics (ICCCI), 2013 International Conference on. IEEE, 1–4.
- Mikolov et al. (2013) Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems. 3111–3119.
- Moen and Ananiadou (2013) SPFGH Moen and Tapio Salakoski2 Sophia Ananiadou. 2013. Distributional semantics resources for biomedical text processing. (2013).
- Resnik et al. (1999) Philip Resnik et al. 1999. Semantic similarity in a taxonomy: An information-based measure and its application to problems of ambiguity in natural language. J. Artif. Intell. Res.(JAIR) 11 (1999), 95–130.
Mondelle Simeon and
Robert Hilderman. 2008.
Categorical proportional difference: A feature selection method for text categorization. InProceedings of the 7th Australasian Data Mining Conference-Volume 87. Australian Computer Society, Inc., 201–208.
- Wu and Palmer (1994) Zhibiao Wu and Martha Palmer. 1994. Verbs semantics and lexical selection. In Proceedings of the 32nd annual meeting on Association for Computational Linguistics. Association for Computational Linguistics, 133–138.
- Yoo and Hu (2006) Illhoi Yoo and Xiaohua Hu. 2006. A comprehensive comparison study of document clustering for a biomedical digital library MEDLINE. In Proceedings of the 6th ACM/IEEE-CS joint conference on Digital libraries. ACM, 220–229.
- Yoo and Song (2006) X. Yoo, I. Hu and I.-Y. Song. 2006. A coherent graph-based semantic clustering and summarization approach for biomedical literature and a new summarization evaluation method. In First International Workshop on Text Mining in Bioinformatics Proceedings. 84–89. https://doi.org/10.1186/1471-2105-8-s9-s4
- Zhang et al. (2007) Xiaodan Zhang, Liping Jing, Xiaohua Hu, Michael Ng, and Xiaohua Zhou. 2007. A comparative study of ontology based term similarity measures on PubMed document clustering. Advances in Databases: Concepts, Systems and Applications (2007), 115–126.
- Zhu et al. (2009) Shanfeng Zhu, Jia Zeng, and Hiroshi Mamitsuka. 2009. Enhancing MEDLINE document clustering by incorporating MeSH semantic similarity. Bioinformatics 25, 15 (2009), 1944–1951.