Topic Modeling for Classification of Clinical Reports

06/19/2017 ∙ by Efsun Sarioglu Kayi, et al. ∙ 0

Electronic health records (EHRs) contain important clinical information about patients. Efficient and effective use of this information could supplement or even replace manual chart review as a means of studying and improving the quality and safety of healthcare delivery. However, some of these clinical data are in the form of free text and require pre-processing before use in automated systems. A common free text data source is radiology reports, typically dictated by radiologists to explain their interpretations. We sought to demonstrate machine learning classification of computed tomography (CT) imaging reports into binary outcomes, i.e. positive and negative for fracture, using regular text classification and classifiers based on topic modeling. Topic modeling provides interpretable themes (topic distributions) in reports, a representation that is more compact than the commonly used bag-of-words representation and can be processed faster than raw text in subsequent automated processes. We demonstrate new classifiers based on this topic modeling representation of the reports. Aggregate topic classifier (ATC) and confidence-based topic classifier (CTC) use a single topic that is determined from the training dataset based on different measures to classify the reports on the test dataset. Alternatively, similarity-based topic classifier (STC) measures the similarity between the reports' topic distributions to determine the predicted class. Our proposed topic modeling-based classifier systems are shown to be competitive with existing text classification techniques and provides an efficient and interpretable representation.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Large amounts of clinically important medical data are now stored in electronic health records (EHRs). In addition to simple performance measurements, more advanced uses may include decision support, such as matching prior patient patterns to recommend the need for a certain medical test or therapy. This can improve effectiveness and efficiency by helping the clinician avoid unnecessary or potentially harmful tests or therapies. However, some of these data are in the form of free text and they need to be processed and coded for better retrieval and analysis by automated or semi-automated systems.

Topic modeling is an unsupervised technique that can automatically identify themes from a given set of documents and find topic distributions for each of them. Representing reports according to their topic distributions is more compact and therefore, they can be processed faster than raw text in subsequent automated processing. Also, biomedical concepts can be well represented as nouns [1] and compared to other parts of speech, they tend to specialize better into topics [2]. Accordingly, we hypothesized that the topic model representation of patient CT reports consisting of nouns will perform favorably compared to conventional machine learning for automated classification of clinical outcomes.

A preliminary version of this work has been reported in [3, 4]. In [3]

, the performance of topic vector classification with conventional classifiers was analyzed and in

[4], aggregate topic classifier (ATC) were introduced using a single dataset. In this study, we introduce two new classifiers, namely, similarity-based and confidence based topic classifiers (STC, CTC), and analyze and compare their performances more thoroughly using two datasets.

2 Material and Methods

Before going to the results and findings of this research; this section provides the technical background to carry out this research: topic modeling and text classification. For topic modeling, we go over the historical progress in the field by explaining the mainly utilized models and how they differ from each other in Section 2.1

. After that, the two popular classification techniques namely, SVM and decision tree, are explained in Section


2.1 Topic Modeling

Topic modeling is an unsupervised learning algorithm that can automatically discover themes of a document collection. Several techniques can be used for this purpose including Latent Semantic Analysis (LSA)

[5], Probabilistic Latent Semantic Analysis (PLSA) [6], and Latent Dirichlet Allocation (LDA) [7]. LSA is a way of representing hidden semantic structure of a term-document matrix in which rows are documents and columns are words/tokens [5]

based on Singular Value Decomposition (SVD). One limitation of LSA is that each word is represented as a single point with the same meaning; therefore in this representation, polysemes of words cannot be differentiated. Also, the final output of LSA, which consists of axes in Euclidean space, is not interpretable or descriptive


PLSA is considered to be a probabilistic version of LSA where an unobserved class variable is associated with each occurrence of a word in a particular document [6]. These classes/topics are then inferred from the input text collection. PLSA solves the polysemy problem; however it is not considered a fully generative model of documents which can lead to overfitting [7].

LDA, first defined by Blei et al [7], defines a topic as a distribution over a fixed vocabulary, where each document can exhibit topics with different proportions. LDA performs better than PLSA for small datasets because it avoids overfitting and it also supports polysemy [7]. Also, in contrast to PLSA, LDA is also considered a fully generative system for documents. Accordingly, LDA is used to generate topic distributions of clinical reports in this study.

2.2 Text Classification

Text classification is a supervised machine learning algorithm where each document’s category is learned from a pre-labeled set of documents. Decision trees and support vector machines (SVM) are two such classification algorithms. In a decision tree, internal nodes are the selected terms from the vocabulary, the branches are the criteria on the weight of the terms and the leaves represent the classes. SVM, on the other hand, attempts to find a decision boundary between classes that is the farthest from any point in the training dataset. Given labeled training data

where and

, it tries to find a separating hyperplane with the maximum margin

[9]. In this study, decision tree and SVM are chosen as classification techniques: Decision tree is preferred due to its explicit rule based output that can be easily evaluated for content validity and SVM performs well in text classification tasks [10, 11].

3 Calculation

Our proposed text classification techniques can be used for various domains. However, our main goal for this study was to utilize such techniques for effective classification of clinical reports. As such, radiology reports from various emergency medicine departments were used to evaluate the proposed classifiers performance. The datasets are computed tomography (CT) imaging reports done for head traumas and they are further explained in the Section 3.1. The preprocessing that they go through before any classification is explained in Section 3.3 and the measures that are used to evaluate the performance of these classifiers are explained in Section 3.2. Finally, after explaining the raw text classification of these clinical reports in Section 3.4, the proposed topic modeling-based classifiers are explained in Section 3.5.

3.1 Dataset

This study used prospectively collected patient CT report data previously collected for derivation of a traumatic orbital fracture clinical risk score [12] and a pediatric traumatic brain injury clinical prediction rule [13]. Staff radiologists dictated each CT report and the outcome of interest (either acute orbital fracture or findings consistent with traumatic brain injury) was extracted by a trained data abstractor. Among the 3,705 orbital CT reports, 3,242 were negative and 463 were positive. Among the 2,126 pediatric head CT reports, 1,973 were negative and 153 were positive. Figures 1 and 2 show sample reports from the orbital and pediatric datasets respectively.

Figure 1: Sample orbital CT report
Figure 2: Sample pediatric CT report

3.2 Evaluation

In this section, the measures used to evaluate the classification algorithms are explained. Once a classifier is built, its performance is evaluated on a separate dataset. To prevent overfitting, only a subset of the dataset, called the training dataset was used to train the classifier. Its effectiveness was then measured in the remaining unseen documents in the testing set. Also, to effectively measure a classifier s success, training and testing datasets with different proportions were prepared: 75%, 66%, 50%, 34%, and 25%. These training and test datasets were randomized and stratified to make sure each subset is a good representation of the original dataset in terms of class distribution. The orbital dataset has a positive class ratio of 12,5% and the pediatric dataset has a positive class ratio of 7,2%. To evaluate the classification performance, precision, recall, and F-score measures were used. For binary classification, possible cases are summarized in Table 1 and Equations 1 and 2 present how precision and recall are calculated.

Predicted class
Positive Negative
Actual Class Positive True Positive (TP) False Negative (FN)
Negative False Positive (FP) True Negative (TN)
Table 1: Confusion matrix

F-score is calculated as an equally weighted harmonic mean of

precision and recall (See Equation 3):


3.3 Preprocessing

Text data must be converted to a suitable format for automated processing. One common way of doing this is bag-of-words (BoW) representation where each document becomes a vector of its words/tokens. The entries in this matrix could be binary stating the existence or absence of a word in a document or it could be weighted according to the number of times a word exists in a document. For this study, using term weights produced slightly better classification results than other options. Frequent words were also removed from the vocabulary to limit its size. In addition, these frequent words typically do not add much information; most were stop words such as is, am, are, the, of, at, and. Other preprocessing tasks such as stemming was also explored; however, they did not have a significant effect on the classification performance.

3.4 Raw Text Classification of Clinical Reports

Raw text of clinical reports were classified by conventional classification techniques as shown in Figure 3. After preprocessing, the raw text files were combined with their associated outcomes and classified using SVM and decision tree in Weka. Weka is a collection of machine learning algorithms for data mining tasks written in Java [14].

Figure 3: System overview for raw text classification

3.5 Topic Modeling-based Classification of Clinical Reports

Clinical reports were classified by topic modeling-based classification techniques as shown in Figure 4. As discussed Section 2.1, we chose LDA to generate the topic models of clinical reports because it is a generative probabilistic system for documents and it is robust to overfitting. The Stanford Topic Modeling Toolbox (TMT) [15] was used to conduct the experiments. It is an open source software providing ways to train and infer topic models for text data.

Figure 4: System overview for topic modeling-based classification

3.5.1 Topic Vector Classifier

In topic vector classifier, a topic model of all of the reports were built and the topic distribution of each report was used to represent them in the form of topic vectors. This could be considered as an alternative representation to bag-of-words (BoW), in which terms are replaced with topics and entries for each report show the probability of a specific topic for that report. Representation as topic vectors is more compact than BoW because the vocabulary for a text collection usually has thousands of entries, whereas a topic model is typically built with a maximum of hundreds of topics. These topic vectors were then classified via conventional classification algorithms, e.g., SVM and decision tree (See Algorithm


learn the topic model for the documents
merge the documents in topic vector representation with their classes
train decision tree and SVM using documents represented as topic vectors
Algorithm 1 Topic Vector Classifier

3.5.2 Confidence-based Topic Classifier (CTC)

In this classifier, after the topic model is learned, a single topic is chosen that has the biggest confidence [16] for a class. The confidence (4) of a topic X for a class Y is calculated as the support (5) of the topic and the class together divided by the support of the topic itself. Using this topic, the predictions for the test dataset are made as shown in Algorithm 2.

learn the topic model for the documents in the training dataset
merge the documents in topic vector representation with their classes
calculate the for each topic T and class C in the training dataset
find the topic t with the biggest confidence for the positive class
pick a threshold for th for the chosen topic t
for all documents in the testing dataset do
     infer the document’s topic distribution
     find its value v for the chosen topic t
     if v >th then
         predict as positive
         predict as negative
     end if
end for
Algorithm 2 Confidence-based Topic Classifier (CTC)

3.5.3 Similarity-based Topic Classifier (STC)

In this classifier, the topic model was learned on the training datasets and the average of topic distributions for each class was calculated. For a document in the testing dataset, its topic distributions were inferred and the class that was the most similar to it was assigned as its predicted class (See Algorithm 2). To calculate the similarity, the cosine measure was used. Given two vectors x and y, the cosine of the angle between them can be calculated as in Equation 6. Its value ranges between 0 and 1 and the more similar the vectors the higher the cosine score is. In this case, one vector represents the average topic distribution for a given class and another vector represents the topic distribution of a test document.

learn the topic model for the documents in the training dataset
merge the documents in topic vector representation with their classes
calculate the average topic distribution of each class
for all documents in the testing dataset do
     infer the document’s topic distribution
     for all classes do
         calculate the similarity between the document’s topic distribution and average topic distribution of the class
     end for
     assign the class that is most similar to the document as predicted class
end for
Algorithm 3 Similarity-based Topic Classifier (STC)

3.5.4 Aggregate Topic Classifier (ATC)

With this approach, a representative topic vector for each class was composed by averaging their corresponding topic distributions in the training dataset. A discriminative topic was then chosen so that the difference between positive and negative representative vectors is maximum as shown in Algorithm 4

. The reports in the test datasets were then classified by analyzing the values of this topic and a threshold was chosen to determine the predicted class. This threshold could be chosen automatically based on class distributions if the dataset is skewed or cross validation methods can be applied to pick a threshold that gives the best classification performance in a validation dataset. This approach is called Aggregate Topic Classifier (ATC) since training labels were utilized in an aggregate fashion using an average function rather than individually.

learn the topic model for the documents in the training dataset
merge the documents in topic vector representation with their classes
calculate the average of topic distributions of each class
pick the topic t whose difference between the average of classes is maximum
pick a threshold th on the selected topic t
for all documents in the testing dataset do
     infer the document’s topic distribution
     find its value v for the chosen topic t
     if  v >th  then
         predict as positive
         predict as negative
     end if
end for
Algorithm 4 Aggregate Topic Classifier (ATC)

4 Results

The main goal of this study is to analyze and optimize clinical text classification. As a starting point, raw text of clinical reports were classified by well-known conventional classification algorithms. Alternatively, topic modeling of the corpora was used as a compact representation of the clinical reports and classifiers were built using this representation in various ways. The classification results using the proposed topic model-based classifiers are presented according to the evaluation techniques explained in Section 3.2 .

4.1 Raw Text Classification Results

Raw text of clinical reports were preprocessed and classified using decision tree (DT) and SVM and they are graphically illustrated in Figures 5 and 6 for the orbital and pediatric datasets respectively. SVM performs better than decision tree consistently for different training and testing proportions and for both datasets.

(a) Precision
(b) Recall
(c) F-score
Figure 5: Raw text classification performance for the orbital dataset
(a) Precision
(b) Recall
(c) F-score
Figure 6: Raw text classification performance for the pediatric dataset

4.2 Topic Modeling-based Classification Results

One of the advantages of switching from using the entire vocabulary to represent documents to using topics as explained in Section 2.1 is the dimension reduction (7) achieved by this transformation.


Typically, the vocabulary of a text corpora has a vocabulary in thousands whereas the total number of topics is usually in lower hundreds. The orbital and pediatric datasets had 1,295 and 1,501 attributes respectively. These numbers reflect the total number of attributes after preprocessing such as removal of frequent and infrequent words. For topic numbers ranging from 5 to 150, a dimension reduction of 88% to 99% is achieved for the orbital dataset. Similarly, for pediatric dataset, 90% to 99% dimension reduction is achieved.

Classification performance of ATC, STC and CTC was compared to SVM and decision tree in Figures 7 and 8 for orbital and pediatric datasets respectively. They are each divided into five sections to show the result of using different training/testing proportions. These training and test datasets are randomized and stratified to make sure each subset is a good representation of the original dataset as explained in Section 3.1. Also, since the best number of topics is not known in advance, different values were considered ranging from 5 to 150. Among all techniques, topic vector classification with SVM performed the best especially with higher number of topics. However, for smaller number of topics, ATC and topic vector classification with decision tree performed better or comparable depending on the training dataset size. Having better performance with lower number of topics is desirable as it leads to faster training and testing times. CTC and STC showed varying success depending on the number of topics and training dataset size; CTC showed improvement as number of topics increased since it uses the entire topic vector for classification. ATC and STC, on the other hand, did not improve as much with the increasing number of topics; since they use a single discriminative topic. Finally, different training and testing proportions had little effect on the classifiers’ performance for both datasets. This implies that the classifiers generalize well and using only small portion for the training dataset would be sufficient to build an accurate classifier. This is a great outcome as typically, it is difficult to find big labeled datasets as the labeling process is costly.

(a) Precision
(b) Recall
(c) F-score
Figure 7: Classification performance using ATC, STC, CTC, DT and SVM for the orbital dataset
(a) Precision
(b) Recall
(c) F-score
Figure 8: Classification performance using ATC, STC, CTC, DT and SVM for the pediatric dataset

To summarize, raw text classification using both decision tree and SVM performed well, with SVM performing better than decision tree for both of the datasets. Alternatively, when topic vector representation of the reports were used, the classification performance got better for both datasets. Between decision tree and SVM, SVM performed better for topic vector classification as well. Among the topic modeling-based classifiers, ATC performed the best for both datasets. ATC also performed better than raw text classification but not better than topic vector classification using SVM. Since ATC is a simpler algorithm compared to SVM, once the topic model is built it may be preferable to use ATC.

5 Discussion

Other than standard topic modeling techniques, there have been studies to further enhance the capabilities of standard topic modeling. In [17]

, Wallach extended the LDA algorithm to handle n-grams. Griffiths et al. combined LDA with POS tagging to have both content and functional words


. These studies resulted in a more complex topic-modeling algorithm mostly to make topic-modeling features comparable to Natural Language Processing (NLP), which can slow down the system. Other NLP-based classification techniques, e.g.,

[18, 19], an be effective in classifying clinical reports as well, however they are computationally expensive and they may require customization by medical experts. As such, we wanted to build a fast and efficient solution using topic modeling without increasing the algorithmic complexity or time to generate topic models. Accordingly, our solutions are based on standard topic modeling algorithms but further extended with our classification techniques.

In the field of text classification, topic modeling techniques have been used in various ways. Zhang et al [20] used topic modeling as a keyword selection mechanism by selecting the top words from topics based on their entropy. In our study, we removed the most frequent and infrequent words to produce a manageable vocabulary size but we did not use topic model output as a keyword selection mechanism. Sriurai [21] compares BoW representation to topic model representation for classification using varying and fixed number of topics respectively. This is similar to our topic vector classification results with SVM. However, because the number of topics typically is not known in advance, we evaluated different numbers of topics, whereas Sriurai [21] uses a fixed number of topics. In another similar study, Banerjee [22] uses topics as additional features to BoW features for the purpose of classification. In our approaches, we used topic vector representation as an alternative to BoW representation and not as additional features. This way, we can achieve greater dimension reduction.

Other than text classification, topic modeling techniques have also been used in related tasks. Arnold et al. [23] shows an information retrieval system where patients can be queried and compared based on their topic distributions. We also used similarity measures to compute the similarity between a report and a class representative topic distribution; however, it is not query-based and it is for classification purposes.

6 Conclusion

In this study, topic modeling of clinical reports was used with different classification techniques and automated clinical outcomes were compared with conventional machine learning techniques. Compared to bag-of-words representation, classification using topic vectors performed comparably with the additional benefit of dimension reduction and interpretability. Several supervised classifiers were built based on topic model of the documents in the training dataset. In confidence-based topic classifier (CTC), the topic with biggest confidence for positive class was used to classify reports in the testing dataset. Alternatively, using a similarity-based topic classifier (STC), to classify a document, its topic distribution was compared in similarity to the average topic distributions of each class Finally, in aggregate topic classifier (ATC), a single discriminative topic was chosen and used to classify the reports in the testing dataset. Among these topic modeling-based classifiers, ATC demonstrated the best classification performance, however topic vector classification using SVM was the most successful among all classifiers. Since ATC uses fewer topics and less complex than SVM, it may be still be preferable to use ATC for faster performance with comparable accuracy.

Results from this study can have significant impacts on the quality and efficiency of healthcare. First of all, the classifiers built in this study can be used to automatically predict the conditions in a clinical report. They can replace the manual review of clinical reports, which can be time consuming and error-prone. In addition, with the increased accuracy and interpretability they provide, clinicians can have more confidence in utilizing such systems in real life settings. Finally, real world datasets such as the ones used in this study could be more challenging than simulated ones. There could be human errors during manual labeling or physicians may disagree. Therefore, it is critical to get good performance on real world datasets so that the systems could be viable to be used in real world settings. Our proposed classifiers provide promising results to be utilized successfully in such settings.


  • [1]

    Y. Huang, H. J. Lowe, D. Klein, R. J. Cucina, Improved Identification of Noun Phrases in Clinical Radiology Reports Using a High-Performance Statistical Natural Language Parser Augmented with the UMLS Specialist Lexicon., J Am Med Inform Assoc 12 (3) (2005) 275–285.

  • [2] T. L. Griffiths, M. Steyvers, D. M. Blei, J. B. Tenenbaum, Integrating Topics and Syntax, in: NIPS, 2004, pp. 537–544.
  • [3] E. Sarioglu, K. Yadav, and H.-A. Choi, Clinical Report Classification Using Natural Language Processing and Topic Modeling, 11th International Conference on Machine Learning and Applications (ICMLA) (2012) 204–209.
  • [4] E. Sarioglu, K. Yadav, H.-A. Choi, Topic Modeling Based Classification of Clinical Reports, in: 51st Annual Meeting of the Association for Computational Linguistics Proceedings of the Student Research Workshop, 2013, pp. 67–73.
  • [5] S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, R. Harshman, Indexing by Latent Semantic Analysis, Journal of the American Society for Information Science 41 (6) (1990) 391–407.
  • [6] T. Hofmann, Probabilistic Latent Semantic Analysis, in: UAI, 1999.
  • [7] D. M. Blei, A. Y. Ng, M. I. Jordan, Latent Dirichlet Allocation, J. Mach. Learn. Res. 3 (2003) 993–1022.
  • [8] T. Hofmann, Unsupervised Learning by Probabilistic Latent Semantic Analysis, Mach. Learn. 42 (1-2) (2001) 177–196.
  • [9] J. C. Platt, Sequential Minimal Optimization: A Fast Algorithm for Training Support Vector Machines (1998).
  • [10] T. Joachims, Text Categorization with Suport Vector Machines: Learning with Many Relevant Features, in: Proceedings of the 10th European Conference on Machine Learning, 1998, pp. 137–142.
  • [11] Y. Yang, X. Liu, A Re-examination of Text Categorization Methods, in: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 1999, pp. 42–49.
  • [12] K. Yadav, E. Cowan, J. S. Haukoos, Z. Ashwell, V. Nguyen, P. Gennis, S. P. Wall, Derivation of a Clinical Risk Score for Traumatic Orbital Fracture., J Trauma Acute Care Surg 73 (5) (2012) 1313–1318.
  • [13] N. Kuppermann, et al., Identification of Children at Very Low Risk of Clinically-Important Brain Injuries After Head Trauma: A Prospective Cohort Study., Lancet 374 (9696) (2009) 1160–1170.
  • [14] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, I. H. Witten, The WEKA Data Mining Software: An Update, SIGKDD Explor. Newsl. 11 (1) (2009) 10–18.
  • [15] Stanford Topic Modeling Toolbox (TMT) Home Page, (2015).
  • [16] R. Agrawal, T. Imieliński, A. Swami, Mining Association Rules Between Sets of Items in Large Databases, in: Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data, 1993, pp. 207–216.
  • [17] H. M. Wallach, Topic Modeling: Beyond Bag-of-Words, in: Proceedings of the 23rd International Conference on Machine Learning, 2006, pp. 977–984.
  • [18] K. Yadav, E. Sarioglu, M. Smith, H.-A. Choi, Automated Outcome Classification of Emergency Department Computed Tomography Imaging Reports, Academic Emergency Medicine 20 (8) (2013) 848–854.
  • [19] K. Yadav, E. Sarioglu, H. Choi, W. B. Cartwright, P. S. Hinds, J. M. Chamberlain, et al., Automated Outcome Classification of Computed Tomography Imaging Reports for Pediatric Traumatic Brain Injury, Academic emergency medicine.
  • [20]

    Z. Zhang, X.-H. Phan, S. Horiguchi, An Efficient Feature Selection Using Hidden Topic in Text Categorization, in: Proceedings of the 22nd International Conference on Advanced Information Networking and Applications - Workshops, 2008, pp. 1223–1228.

  • [21] W. Sriurai, Improving Text Categorization by Using a Topic Model, Advanced Computing: An International Journal (ACIJ) 2 (6).
  • [22] S. Banerjee, Improving Text Classification Accuracy Using Topic Modeling over an Additional Corpus, in: Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 2008, pp. 867–868.
  • [23] C. W. Arnold, S. M. El-Saden, A. A. T. Bui, R. Taira, Clinical Case-based Retrieval Using Latent Topic Analysis, AMIA Annu Symp Proc 2010 (2010) 26–30.